CN116090448A - Entity abbreviation generation method, device, computer equipment and storage medium - Google Patents

Entity abbreviation generation method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116090448A
CN116090448A CN202111314465.0A CN202111314465A CN116090448A CN 116090448 A CN116090448 A CN 116090448A CN 202111314465 A CN202111314465 A CN 202111314465A CN 116090448 A CN116090448 A CN 116090448A
Authority
CN
China
Prior art keywords
word
entity
component
words
continuous character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111314465.0A
Other languages
Chinese (zh)
Inventor
郑钧
赵旭
张乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111314465.0A priority Critical patent/CN116090448A/en
Publication of CN116090448A publication Critical patent/CN116090448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to the technical field of natural language processing, and discloses a method, a device, computer equipment and a storage medium for generating entity abbreviations, wherein the method comprises the following steps: word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word; calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule; and determining the target entity abbreviation according to the frequency gain corresponding to each component word. By the mode, the embodiment of the invention realizes automatic and simple generation of entity abbreviations, and has high accuracy and low cost.

Description

Entity abbreviation generation method, device, computer equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of natural language processing, in particular to a method and a device for generating entity abbreviations, computer equipment and a storage medium.
Background
The data intelligent report forms are used as visual presentation of data intelligent analysis, and visual display is carried out on data mining results. In the traditional report drawing scheme, related personnel are required to make and analyze the report on the basis of a certain service scene, and a great deal of time and effort are required to be invested. In particular, in various kinds of statistical chart display, abbreviations or short names of original entity names are often required to be used. Moreover, there are many usage scenarios for short in the expression of business names, government agency names, and social group names. With the access of various industries to big data applications, the number of entity classes is increased, and a great deal of effort and cost are required to generate reasonable entity abbreviations through a manual mode.
Disclosure of Invention
In view of the above problems, embodiments of the present invention provide a method, an apparatus, a computer device, and a storage medium for generating an entity abbreviation, which are used for solving the problem that generating an entity abbreviation in the prior art needs to consume a great deal of effort and cost.
According to an aspect of an embodiment of the present invention, there is provided a method for generating entity abbreviations, the method including:
word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
In an optional manner, before the word segmentation is performed on the target entity full scale based on the preset corpus to obtain each component word forming the target entity full scale, the method further includes:
combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
Traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
In an optional manner, the performing association analysis on each common continuous character subset to obtain a support degree of each common continuous character subset, further includes:
and carrying out relevance analysis on each public continuous character subset based on an Apriori algorithm to obtain the support degree of each public continuous character subset.
In an optional manner, the word segmentation is performed on the entity full scale in the entity library based on the word segmentation library, and a corpus is generated according to the word segmentation result, which further includes:
dividing words of all entity names in the entity library based on the word dividing library to respectively obtain words forming all the entity names;
and establishing a corpus according to the words which form the full names of the entities.
In an optional manner, the calculating, according to the number of occurrences corresponding to each word and a preset rule, a frequency gain corresponding to each component word further includes:
determining the occurrence times corresponding to the component words according to the occurrence times corresponding to the words;
removing each component word one by one, and respectively calculating word occurrence frequency sum corresponding to entities formed by the remaining component words after removal according to the occurrence times corresponding to each component word;
and respectively calculating the frequency gain of each component word based on a greedy algorithm according to the word occurrence frequency sum.
In an optional manner, after the frequency of occurrence corresponding to each component word is obtained according to the number of occurrences corresponding to each word, the method further includes:
And carrying out logarithmic processing on the occurrence frequency corresponding to each component word.
In an optional manner, the determining the target entity is abbreviated as according to the frequency gain corresponding to each component word, further includes:
based on the frequency gain corresponding to each component word, sequencing each component word according to the sequence from small to large;
combining and eliminating each component word according to the frequency gain from small to large, and respectively calculating word occurrence frequency sums corresponding to entities formed by the residual component words after the combination and elimination according to the occurrence times corresponding to each component word;
according to the word occurrence frequency sum, calculating frequency gain of each component word combination based on a greedy algorithm;
and constructing the target entity abbreviation based on the composition word combination with the frequency gain smaller than a preset frequency gain threshold.
According to another aspect of the embodiment of the present invention, there is provided an entity abbreviation generating device, including:
the word segmentation module is used for segmenting the target entity full scale based on a preset corpus to obtain each composition word for forming the target entity full scale; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
The calculating module is used for calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and the processing module is used for determining the target entity abbreviation according to the frequency gain corresponding to each component word.
In an optional manner, the entity abbreviated generation device further includes: the device comprises a combination module, an extraction module, an analysis module and an establishment module; wherein, the liquid crystal display device comprises a liquid crystal display device,
the combination module is used for respectively combining every two entity scales in the set entity library to obtain at least one pair of entity scale combinations;
the extraction module is used for traversing each pair of entity full scale combinations and obtaining a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
the analysis module is used for carrying out association analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
the establishing module is used for establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
The word segmentation module is also used for carrying out word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to the word segmentation result; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
According to another aspect of an embodiment of the present invention, there is provided a computer apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to hold at least one executable instruction that causes the processor to:
word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
In one alternative, the executable instructions cause the processor to:
Combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
According to yet another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction for causing a computer apparatus/device to:
Word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
In one alternative, the executable instructions cause a computer device/apparatus to:
combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
Establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
The embodiment of the invention quantifies the criticality degree of the words through the occurrence times of the words, calculates the frequency gain corresponding to the words based on the occurrence times of the words, and further reserves the criticality words in the original entity names according to the frequency gain corresponding to the words to generate entity abbreviations, thereby realizing automatic and simple generation of the entity abbreviations with high accuracy.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.
Drawings
The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
Fig. 1 shows a flow chart of an entity abbreviation generating method according to an embodiment of the present invention;
fig. 2 is a flow chart illustrating a method for generating entity abbreviations according to another embodiment of the present invention;
FIG. 3 is a schematic diagram showing the distribution of word usage frequencies in an embodiment of the present invention;
fig. 4 is a schematic process diagram of an entity short generation method in an embodiment of the present invention;
fig. 5 shows a schematic structural diagram of an entity short generating device according to an embodiment of the present invention;
fig. 6 shows a second schematic structural diagram of the entity short generating device according to the embodiment of the present invention;
fig. 7 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.
Fig. 1 shows a flow chart of an embodiment of the inventive entity short generation method, which is performed by a computer device. The computer device refers to a device with computing capability, and includes, but is not limited to, a terminal device (such as a mobile phone, a tablet computer, etc.), a wearable intelligent device (such as a smart watch, a smart bracelet, a smart earphone, etc.), a smart home device (such as a smart television, a smart sound box, etc.), a car networking device (such as a smart car, a car terminal, etc.), a server, etc. As shown in fig. 1, the method comprises the steps of:
Step S110: word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
The target entity is called as an entity whole for acquiring short, and the entity comprises but is not limited to enterprises, government authorities, social groups and the like. The corpus is established after word segmentation and other processing is carried out based on entity names in the set entity library, and comprises at least one word and the occurrence frequency corresponding to each word.
In an embodiment, before the word segmentation is performed on the target entity full scale based on the preset corpus to obtain each component word forming the target entity full scale, the method further includes:
combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
Carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence times corresponding to the word.
The step of combining each two entity names in the set entity library respectively refers to combining any two entity names in the entity library into a pair of entity name combinations, for example, if the entity library contains four entity names, combining any two entity names into a pair of entity name combinations, and obtaining six pairs of entity name combinations. Extracting character strings which are simultaneously positioned in each entity holonomic of the entity holonomic combination and have the length larger than a preset length threshold value for each pair of the entity holonomic combination, and taking the character strings as public continuous character subsets in the entity holonomic combination; and simultaneously, performing difference set operation on each entity full scale of the entity full scale combination and a public continuous character subset in the entity full scale combination, respectively obtaining character subsets of single entity full scales belonging to the entity full scale combination, and taking the character subsets as specific continuous character subsets in the entity full scale combination. The length threshold may be set according to actual situation requirements, for example, may be set to 1 or 2 characters, etc. Taking the example that the entity full scale combination comprises a Chinese mobile Beijing division company and a Chinese mobile Shanghai division company, after character extraction is carried out according to the mode, public continuous character subsets in the entity full scale combination are the Chinese mobile division company and the Shanghai division company, and specific continuous character subsets in the entity full scale combination are the Beijing division company and the Shanghai division company. And then, carrying out relevance analysis on each public continuous character subset based on an Apriori algorithm to obtain the support degree of each public continuous character subset, wherein the higher the support degree of the public continuous character subset is, the higher the probability that the public continuous character subset appears in pairs is, and the method can be used as a word segmentation standard. Then, a word segmentation library can be established based on the public continuous character subset with the support degree larger than a preset support degree threshold value and each specific continuous character subset, specifically, the public continuous character subset with high probability of paired occurrence and the specific continuous character subset in each entity full scale combination are added into the word segmentation library, so that the word segmentation library is used as a word segmentation standard. The support threshold may be set according to actual needs, and is not specifically limited herein. Here, the word segmentation is performed on the entity full names in the entity library based on the word segmentation library, and a corpus is generated according to the word segmentation result, which further includes: dividing words of all entity names in the entity library based on the word dividing library to respectively obtain words forming all the entity names; and establishing a corpus according to the words which form the full names of the entities. It should be noted that the word segmentation result at least includes words that constitute the full name of each entity. Here, after the word segmentation is performed on each entity full scale in the entity library based on the word segmentation library, words forming each entity full scale can be obtained, and the travel times corresponding to each word can be obtained based on the occurrence times of each word in different entity full scales, namely, the sum of the occurrence times of each word in different entity full scales is determined as the travel times corresponding to the word. Therefore, the separation of the terms can be avoided by the word segmentation mode based on the public continuous subset, and the word segmentation speed and the rationality of the word segmentation result are greatly improved under the condition of not depending on the public corpus.
Step S120: and calculating the frequency gain corresponding to each component word according to the frequency of occurrence corresponding to each word and a preset rule.
The frequency gain corresponding to each component word is calculated according to the frequency of occurrence corresponding to each word and a preset rule, and the method further comprises the steps of:
determining the occurrence times corresponding to the component words according to the occurrence times corresponding to the words;
removing each component word one by one, and respectively calculating word occurrence frequency sum corresponding to entities formed by the remaining component words after removal according to the occurrence times corresponding to each component word;
and respectively calculating the frequency gain of each component word based on a greedy algorithm according to the word occurrence frequency sum.
Here, since the corpus includes the number of occurrences corresponding to each word, after the target entity full scale is segmented based on the corpus, the number of occurrences corresponding to each component word constituting the target entity full scale can be obtained accordingly. When the component words are not removed one by one, according to the occurrence frequency corresponding to each component word composing the target entity full scale and the occurrence frequency corresponding to each word, the occurrence frequency corresponding to each component word at the moment can be calculated, namely, for each component word, the occurrence frequency corresponding to the component word is divided by the sum of the occurrence frequencies corresponding to each word, and then the word occurrence frequency sum corresponding to the target entity full scale at the moment, namely, the sum of the occurrence frequencies corresponding to the component words can be obtained. And when the component words are removed one by one, according to the occurrence frequency corresponding to each component word and the occurrence frequency corresponding to each word, which are fully called by the target entity, the occurrence frequency corresponding to each component word at the moment can be calculated, namely, for each remaining component word, the occurrence frequency corresponding to the component word is divided by the target frequency, wherein the target frequency is the sum of the occurrence frequencies corresponding to each word minus the occurrence frequency corresponding to the currently removed component word, and then the sum of the occurrence frequencies corresponding to the entity formed by the remaining component words after the removal at the moment, namely, the sum of the occurrence frequencies corresponding to the remaining component words can be obtained. After each component word is removed one by one, frequency gain of each component word can be calculated based on greedy algorithm according to word occurrence frequency sum corresponding to the entity formed by the remaining component words after removal and word occurrence frequency sum corresponding to the target entity. Therefore, the keyword degree of the words is quantized through the occurrence times of the words, the frequency gain of the words is calculated based on the occurrence times of the words, and the processing speed is high and the rationality is high.
In an embodiment, after the obtaining the occurrence frequency corresponding to each of the component words according to the occurrence times corresponding to each of the words, the method further includes: and carrying out logarithmic processing on the occurrence frequency corresponding to each component word. It will be appreciated that due to the nature of words in everyday use, the frequency of use of words is typically a power law distribution, i.e. small parts of words are often found in everyday use, while most words are used very infrequently. In order to reduce the influence of the power law distribution characteristics on the calculation, logarithmic processing is required to be carried out on the occurrence frequency corresponding to each component word.
Step S130: and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
Specifically, based on the frequency gain corresponding to each component word, sequencing each component word in order from small to large; combining and eliminating each component word according to the frequency gain from small to large, and respectively calculating word occurrence frequency sums corresponding to entities formed by the residual component words after the combination and elimination according to the occurrence times corresponding to each component word; according to the word occurrence frequency sum, calculating frequency gain of each component word combination based on a greedy algorithm; and constructing the target entity abbreviation based on the composition word combination with the frequency gain smaller than a preset frequency gain threshold.
Here, based on the frequency gain corresponding to each of the constituent words, the constituent words may be ranked in order of the frequency gain from small to large, that is, the constituent word having the small frequency gain is ranked in front and the constituent word having the large frequency gain is ranked in rear. The step of performing combined elimination on each component word according to the frequency gain from small to large may be to perform combined elimination on the component word with a smaller frequency gain, for example, 2 component words are combined or 3 component words are combined, and then, when each component word is combined and eliminated, according to the occurrence frequency corresponding to each component word and the occurrence frequency corresponding to each word which make up the full name of the target entity, the occurrence frequency corresponding to each component word at this time may also be calculated, that is, for each remaining component word, the occurrence frequency corresponding to the component word is divided by a target frequency, where the target frequency is the sum of the occurrence frequencies corresponding to each word minus the occurrence frequency corresponding to the component word of the current combined elimination, so that the sum of the occurrence frequencies corresponding to the entity formed by the remaining component words after combined elimination, that is, the sum of the occurrence frequencies corresponding to the remaining component words, may be obtained. After the combination of the constituent words is removed, frequency gains of the combination of the constituent words can be calculated respectively based on a greedy algorithm according to word occurrence frequency sums corresponding to entities formed by the remaining constituent words after the combination is removed and word occurrence frequency sums corresponding to the whole name of the target entity.
In summary, in the entity abbreviation generating method provided in the above embodiment, the degree of criticality of the word is quantized by the occurrence frequency of the word, and the frequency gain corresponding to the word is calculated based on the occurrence frequency of the word, so that the criticality word in the original entity name is reserved according to the frequency gain corresponding to the word to generate the entity abbreviation, thereby realizing automatic, simple and rapid generation of the entity abbreviation with high accuracy.
The foregoing embodiments are specifically described below by way of specific examples based on the same inventive concept as the foregoing embodiments, and in this embodiment, the frequency gain is referred to as a probability gain as an example.
In each component word of the entity full scale, two types of words, namely a keyword and a non-keyword exist, the keyword generally comprises a core of the entity full scale, and the content composed of the keyword can replace the entity full scale to a certain extent, so that the entity full scale can be abbreviated to obtain entity short names. In general, the repeated words are well known and belong to common words. While some words appear less frequently, the meaning expressed is critical, belonging to a particular word. Based on the above ideas, the present application proposes a generating method of entity abbreviations based on a probability gain model, namely, calculating probability gain after word segmentation is performed on entity names, and obtaining optimal entity abbreviations by taking maximized probability gain as a target. In the text preprocessing part of the traditional keyword extraction scheme, based on the traditional word segmentation technology, the fine granularity of the word segmentation is difficult to control, for example, "Beijing university" is divided into "Beijing" and "university", and the existing solution is to specially process the commonly-owned names. Because the entity names of institutions are mostly proper nouns and are strongly related to specific fields, similar 'cloud capability center of China mobile Suzhou', the 'cloud capability center' is segmented into 'cloud capability' and 'center' by traditional text preprocessing with high probability. Therefore, a threshold over-term split is required in the data preprocessing process. When word segmentation is carried out, compared with the traditional word segmentation scene, the prior art is mainly a hidden Markov word segmentation method based on a massive text set, and word segmentation is carried out aiming at a mechanism entity name library. Because the text set is mainly made of terms and has certain specificity, the invention also provides a word segmentation scheme based on a continuous public subset, and the word segmentation scheme can greatly improve the word segmentation training speed and the word segmentation result rationality under the condition of not depending on a public corpus. Fig. 2 shows a flowchart of an embodiment of a generating method for entity abbreviation of the present invention, as shown in fig. 2, the method comprises the following steps:
Step S210: acquiring an entity name, and segmenting the entity name;
step S220: initializing an entity word corpus, and counting word occurrence frequency;
step S230: traversing each word of the target entity name;
step S240: calculating the frequency gain of each word segmentation and updating the entity word corpus;
step S250: and selecting the maximum frequency gain, and re-splicing the words to obtain the entity abbreviation.
The specific process is as follows:
step1, acquiring all entity names in entity tables in a database to obtain an entity name set C= { C 1 ,c 2 ,…,c n };
Step2, preprocessing entity name text data;
step2.1, for each entity, call c in its entirety i Word segmentation processing is carried out to obtain corresponding entity composition words
Figure BDA0003343153860000121
Since entity names are mostly proper nouns and are strongly related to specific fields, a threshold value needs to be set in the data preprocessing process to filter out the split of the proper nouns. Specifically, for entity name set { c 1 ,c 2 ,…,c n Two-by-two combinations of entity names { (c) 1 ,c 2 ),…,(c n-1 ,c n ) The combined quantity is->
Figure BDA0003343153860000122
The traversal body names are combined in pairs { (c) 1 ,c 2 ),…,(c n-1 ,c n ) Calculating common consecutive subsets of strings of length greater than 1 in pairwise combinations, e.g. there is +. >
Figure BDA0003343153860000123
Where k is the number of common consecutive subsets having a string length greater than 1, an
Figure BDA0003343153860000124
Representing a common contiguous subset operation of the two entity name strings.
Obtaining specific continuous subsets of two entity names respectively through difference set operation
Figure BDA0003343153860000125
And
Figure BDA0003343153860000126
wherein u and t are each c 1 And c 2 Is the number of consecutive subsets of the characteristic, and +.>
Figure BDA0003343153860000127
Character string c representing entity name 1 And->
Figure BDA0003343153860000128
Difference set operations for a common contiguous subset.
In the two-by-two entity full scale collection
Figure BDA0003343153860000129
Carrying out relevance analysis on common continuous subsets in all combinations by using an Apriori algorithm, and calculating to obtain the support degree of each common continuous subset
Figure BDA00033431538600001210
Wherein the calculation formula of the support degree of the public continuous subset is as follows
Figure BDA00033431538600001211
Public continuityThe higher the support of the subset, the higher the probability that the subset of the part appears at the same time in pairs, which can be used as a word segmentation standard, and the scheme does not use the common continuous subset with few sporadic parts as a word segmentation standard. Selecting a support degree greater than a certain threshold value xi>Sigma construction word library->
Figure BDA00033431538600001212
Common contiguous subset and unique contiguous subset comprising all entity names +.>
Figure BDA00033431538600001213
Traversing entity full scale set c= { C 1 ,c 2 ,…,c n By word stock subset->
Figure BDA00033431538600001214
The entity full name character string is segmented to obtain an entity composition word ++ >
Figure BDA00033431538600001215
Wherein->
Figure BDA00033431538600001216
And s is c 1 i ∈c 1
Step2.2, constructing a corpus s= { S of entity words 1 ,s 2 ,…,s m }, wherein
Figure BDA00033431538600001217
Step2.3, calculate the entity name set { c } 1 ,c 2 ,…,c n Word occurrence frequency of }.
Counting the occurrence times of entity words
Figure BDA00033431538600001218
And calculates the occurrence frequency p of each entity word i Wherein
Figure BDA00033431538600001219
And p is i E (0, 1). Due to the nature of words in everyday use, the frequency of use of words is typically a power law distribution, as shown in fig. 3. Small parts of words often appear in daily use, while most words are used less frequently.
To reduce the impact of the power law profile on the computation, the frequency of occurrence p of the word in this embodiment i Logarithmic processing is performed with the following formula:
Figure BDA0003343153860000131
obtaining word present frequency set
Figure BDA0003343153860000132
Step3, according to the word occurrence frequency set P in Step2 * Calculating word occurrence probability sum h of target entity names x Wherein h is x The calculation formula of (2) is as follows:
Figure BDA0003343153860000133
step4, word segmentation set according to traversing target entity names
Figure BDA0003343153860000134
The following operations are performed.
Step4.1, one by one combining words of entity names
Figure BDA0003343153860000135
Reject from the word segmentation set of the target entity name and repeat steps step2.2 and step2.3.
Step4.2, calculate and reject the composition word of the entity name
Figure BDA0003343153860000136
Post word occurrence probability sum->
Figure BDA0003343153860000137
On the one hand, the word occurrence probability is reduced due to the fact that the word occurrence probability of the component words is deleted, and on the other hand, the occurrence frequency of the entity words is reduced due to the fact that the component words with higher occurrence frequency are deleted +. >
Figure BDA0003343153860000138
Then +.>
Figure BDA0003343153860000139
The number of (2) increases and the lift is greater.
Step4.3, multiple redundant word combinations. When calculating the probability gain of a single word, an implementation manner of traversing the entity name elimination is adopted, but in the part of combining the entity name elimination, the embodiment calculates the entity composition word by adopting a method of combining and optimizing the entity name elimination based on a greedy algorithm due to calculation complexity
Figure BDA00033431538600001310
The probability gain is calculated as follows: />
Figure BDA00033431538600001311
Wherein h is i Representation->
Figure BDA00033431538600001312
Word occurrence probability sum of (2). And sequencing the probability gains to obtain a probability gain set delta H, and storing the probability gain set delta H.
Step4.4, removing the combination of the component words from small to large in the probability gain set delta H, and storing the entity component words with the probability gain smaller than a certain threshold epsilon to obtain a component word set C * ={c 1 ,c 2 ,…,c n }-{c α ,…,c γ }。
Step5, weighing the entity. The set of component words C from Step4 * And carrying out combination reconstruction on the entity abbreviations and outputting. Repeating the steps, and generating all entity names for short. FIG. 4 is a process diagram showing an embodiment of the method for generating entity abbreviations of the present invention, as shown in FIG. 4, a target entityThe name "cloud capability center of China mobile Suzhou" is processed by the entity short generation method to obtain a corresponding entity short "cloud capability center".
In summary, in the entity abbreviation generating method provided in the above embodiment, the entity names are segmented to construct a corpus, the keyword degree of the words is defined by using the occurrence frequency of the statistical words, the probability gain is calculated by an iterative method, the keyword words in the original entity names are reserved, and the entity abbreviation is generated in a word reconstruction mode. Compared with the prior art, the method has the following advantages: the simplified processing can be automatically carried out on various entity names; no additional service logic and service information support are required; the entity names are segmented, a corpus is constructed, and the criticality degree of each word is quantified according to the occurrence probability of the word in the corpus; the entity abbreviation is generated in a word reconstruction mode, training is not needed by a training set, and the method belongs to an unsupervised learning method.
Fig. 5 shows a schematic structural diagram of an embodiment of the generating device for entity short according to the present invention. As shown in fig. 5, the apparatus 300 includes: a word segmentation module 310, a calculation module 320, and a processing module 330; wherein, the liquid crystal display device comprises a liquid crystal display device,
the word segmentation module 310 is configured to segment the target entity full scale based on a preset corpus, so as to obtain each component word that forms the target entity full scale; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
The calculating module 320 is configured to calculate, according to a preset rule, a frequency gain corresponding to each of the component words according to the number of occurrences corresponding to each of the words;
and the processing module 330 is configured to determine the target entity abbreviation according to the frequency gain corresponding to each of the component words.
The target entity is called as an entity whole for acquiring short, and the entity comprises but is not limited to enterprises, government authorities, social groups and the like. The corpus is established after word segmentation and other processing is carried out based on entity names in the set entity library, and comprises at least one word and the occurrence frequency corresponding to each word.
In an alternative manner, the computing module 320 is specifically configured to: determining the occurrence times corresponding to the component words according to the occurrence times corresponding to the words; removing each component word one by one, and respectively calculating word occurrence frequency sum corresponding to entities formed by the remaining component words after removal according to the occurrence times corresponding to each component word; and respectively calculating the frequency gain of each component word based on a greedy algorithm according to the word occurrence frequency sum.
Here, since the corpus includes the number of occurrences corresponding to each word, after the target entity full scale is segmented based on the corpus, the number of occurrences corresponding to each component word constituting the target entity full scale can be obtained accordingly. When the component words are not removed one by one, according to the occurrence frequency corresponding to each component word composing the target entity full scale and the occurrence frequency corresponding to each word, the occurrence frequency corresponding to each component word at the moment can be calculated, namely, for each component word, the occurrence frequency corresponding to the component word is divided by the sum of the occurrence frequencies corresponding to each word, and then the word occurrence frequency sum corresponding to the target entity full scale at the moment, namely, the sum of the occurrence frequencies corresponding to the component words can be obtained. And when the component words are removed one by one, according to the occurrence frequency corresponding to each component word and the occurrence frequency corresponding to each word, which are fully called by the target entity, the occurrence frequency corresponding to each component word at the moment can be calculated, namely, for each remaining component word, the occurrence frequency corresponding to the component word is divided by the target frequency, wherein the target frequency is the sum of the occurrence frequencies corresponding to each word minus the occurrence frequency corresponding to the currently removed component word, and then the sum of the occurrence frequencies corresponding to the entity formed by the remaining component words after the removal at the moment, namely, the sum of the occurrence frequencies corresponding to the remaining component words can be obtained. After each component word is removed one by one, frequency gain of each component word can be calculated based on greedy algorithm according to word occurrence frequency sum corresponding to the entity formed by the remaining component words after removal and word occurrence frequency sum corresponding to the target entity. Therefore, the keyword degree of the words is quantized through the occurrence times of the words, the frequency gain of the words is calculated based on the occurrence times of the words, and the processing speed is high and the rationality is high.
In an embodiment, the calculating module 320 is further configured to log the occurrence frequency corresponding to each of the constituent words after obtaining the occurrence frequency corresponding to each of the constituent words according to the occurrence frequency corresponding to each of the words. It will be appreciated that due to the nature of words in everyday use, the frequency of use of words is typically a power law distribution, i.e. small parts of words are often found in everyday use, while most words are used very infrequently. In order to reduce the influence of the power law distribution characteristics on the calculation, logarithmic processing is required to be carried out on the occurrence frequency corresponding to each component word.
In one embodiment, the processing module 330 is specifically configured to: based on the frequency gain corresponding to each component word, sequencing each component word according to the sequence from small to large; combining and eliminating each component word according to the frequency gain from small to large, and respectively calculating word occurrence frequency sums corresponding to entities formed by the residual component words after the combination and elimination according to the occurrence times corresponding to each component word; according to the word occurrence frequency sum, calculating frequency gain of each component word combination based on a greedy algorithm; and constructing the target entity abbreviation based on the composition word combination with the frequency gain smaller than a preset frequency gain threshold. Here, based on the frequency gain corresponding to each of the constituent words, the constituent words may be ranked in order of the frequency gain from small to large, that is, the constituent word having the small frequency gain is ranked in front and the constituent word having the large frequency gain is ranked in rear. The step of performing combined elimination on each component word according to the frequency gain from small to large may be to perform combined elimination on the component word with a smaller frequency gain, for example, 2 component words are combined or 3 component words are combined, and then, when each component word is combined and eliminated, according to the occurrence frequency corresponding to each component word and the occurrence frequency corresponding to each word which make up the full name of the target entity, the occurrence frequency corresponding to each component word at this time may also be calculated, that is, for each remaining component word, the occurrence frequency corresponding to the component word is divided by a target frequency, where the target frequency is the sum of the occurrence frequencies corresponding to each word minus the occurrence frequency corresponding to the component word of the current combined elimination, so that the sum of the occurrence frequencies corresponding to the entity formed by the remaining component words after combined elimination, that is, the sum of the occurrence frequencies corresponding to the remaining component words, may be obtained. After the combination of the constituent words is removed, frequency gains of the combination of the constituent words can be calculated respectively based on a greedy algorithm according to word occurrence frequency sums corresponding to entities formed by the remaining constituent words after the combination is removed and word occurrence frequency sums corresponding to the whole name of the target entity.
In an alternative manner, as shown in fig. 6, the entity abbreviated generating device further includes: a combination module 340, an extraction module 350, an analysis module 360, and a library module 370; wherein, the liquid crystal display device comprises a liquid crystal display device,
the combination module 340 is configured to combine the full scales of each two entities in the set entity library respectively to obtain at least one pair of full scale combinations of the entities;
an extracting module 350, configured to traverse each pair of the entity full-scale combinations and obtain a common continuous character subset and a specific continuous character subset in each pair of the entity full-scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
an analysis module 360, configured to perform a relevance analysis on each of the common continuous character subsets, so as to obtain a support degree of each of the common continuous character subsets;
the library building module 370 builds a word segmentation library based on the common continuous character subsets with the support degree larger than a preset support degree threshold value and each specific continuous character subset;
the word segmentation module 310 is further configured to segment the entity names in the entity library based on the word segmentation library, and generate a corpus according to the word segmentation result; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
The step of combining each two entity names in the set entity library respectively refers to combining any two entity names in the entity library into a pair of entity name combinations, for example, if the entity library contains four entity names, combining any two entity names into a pair of entity name combinations, and obtaining six pairs of entity name combinations. Extracting character strings which are simultaneously positioned in each entity holonomic of the entity holonomic combination and have the length larger than a preset length threshold value for each pair of the entity holonomic combination, and taking the character strings as public continuous character subsets in the entity holonomic combination; and simultaneously, performing difference set operation on each entity full scale of the entity full scale combination and a public continuous character subset in the entity full scale combination, respectively obtaining character subsets of single entity full scales belonging to the entity full scale combination, and taking the character subsets as specific continuous character subsets in the entity full scale combination. The length threshold may be set according to actual situation requirements, for example, may be set to 1 or 2 characters, etc. Taking the example that the entity full scale combination comprises a Chinese mobile Beijing division company and a Chinese mobile Shanghai division company, after character extraction is carried out according to the mode, public continuous character subsets in the entity full scale combination are the Chinese mobile division company and the Shanghai division company, and specific continuous character subsets in the entity full scale combination are the Beijing division company and the Shanghai division company. And then, carrying out relevance analysis on each public continuous character subset based on an Apriori algorithm to obtain the support degree of each public continuous character subset, wherein the higher the support degree of the public continuous character subset is, the higher the probability that the public continuous character subset appears in pairs is, and the method can be used as a word segmentation standard. Then, a word segmentation library can be established based on the public continuous character subset with the support degree larger than a preset support degree threshold value and each specific continuous character subset, specifically, the public continuous character subset with high probability of paired occurrence and the specific continuous character subset in each entity full scale combination are added into the word segmentation library, so that the word segmentation library is used as a word segmentation standard. The support threshold may be set according to actual needs, and is not specifically limited herein. Here, the word segmentation module 310 is specifically configured to: dividing words of all entity names in the entity library based on the word dividing library to respectively obtain words forming all the entity names; and establishing a corpus according to the words which form the full names of the entities. It should be noted that the word segmentation result at least includes words that constitute the full name of each entity. Here, after the word segmentation is performed on each entity full scale in the entity library based on the word segmentation library, words forming each entity full scale can be obtained, and the travel times corresponding to each word can be obtained based on the occurrence times of each word in different entity full scales, namely, the sum of the occurrence times of each word in different entity full scales is determined as the travel times corresponding to the word. Therefore, the separation of the terms can be avoided by the word segmentation mode based on the public continuous subset, and the word segmentation speed and the rationality of the word segmentation result are greatly improved under the condition of not depending on the public corpus.
In summary, in the entity abbreviation generating device provided in the above embodiment, the degree of criticality of the word is quantized by the occurrence frequency of the word, and the frequency gain corresponding to the word is calculated based on the occurrence frequency of the word, so that the criticality word in the original entity name is reserved according to the frequency gain corresponding to the word to generate the entity abbreviation, thereby realizing automatic, simple and rapid generation of the entity abbreviation with high accuracy.
FIG. 7 is a schematic diagram of an embodiment of a computer device according to the present invention, and the embodiment of the present invention is not limited to the specific implementation of the computer device.
As shown in fig. 7, the computer device may include: a processor 402, a communication interface (Communications Interface) 404, a memory 406, and a communication bus 408.
Wherein: processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. The processor 402 is configured to execute the program 410, and may specifically perform the relevant steps in the embodiment of the generating method for entity abbreviation.
In particular, program 410 may include program code including computer-executable instructions.
The processor 402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computer device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 406 for storing programs 410. Memory 406 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 410 may be specifically invoked by processor 402 to cause a computer device to:
word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
In an alternative, the program 410 is invoked by the processor 402 to cause a computer device to:
combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
In an alternative, the program 410 is invoked by the processor 402 to cause a computer device to:
And carrying out relevance analysis on each public continuous character subset based on an Apriori algorithm to obtain the support degree of each public continuous character subset.
In an alternative, the program 410 is invoked by the processor 402 to cause a computer device to:
dividing words of all entity names in the entity library based on the word dividing library to respectively obtain words forming all the entity names;
and establishing a corpus according to the words which form the full names of the entities.
In an alternative, the program 410 is invoked by the processor 402 to cause a computer device to:
determining the occurrence times corresponding to the component words according to the occurrence times corresponding to the words;
removing each component word one by one, and respectively calculating word occurrence frequency sum corresponding to entities formed by the remaining component words after removal according to the occurrence times corresponding to each component word;
and respectively calculating the frequency gain of each component word based on a greedy algorithm according to the word occurrence frequency sum.
In an alternative, the program 410 is invoked by the processor 402 to cause a computer device to:
and carrying out logarithmic processing on the occurrence frequency corresponding to each component word.
In an alternative, the program 410 is invoked by the processor 402 to cause a computer device to:
based on the frequency gain corresponding to each component word, sequencing each component word according to the sequence from small to large;
combining and eliminating each component word according to the frequency gain from small to large, and respectively calculating word occurrence frequency sums corresponding to entities formed by the residual component words after the combination and elimination according to the occurrence times corresponding to each component word;
according to the word occurrence frequency sum, calculating frequency gain of each component word combination based on a greedy algorithm;
and constructing the target entity abbreviation based on the composition word combination with the frequency gain smaller than a preset frequency gain threshold.
In summary, in the computer device provided in the foregoing embodiment, the degree of criticality of a term is quantized by the occurrence number of the term, and the frequency gain corresponding to the term is calculated based on the occurrence number of the term, so that the criticality term in the original entity name is reserved according to the frequency gain corresponding to the term to generate an entity abbreviation, thereby realizing automatic, simple and rapid generation of the entity abbreviation with high accuracy.
The embodiment of the invention provides a computer readable storage medium, which stores at least one executable instruction, and when the executable instruction runs on computer equipment/device, the computer equipment/device executes the entity short generation method in any method embodiment.
The executable instructions may be particularly useful for causing a computer device/apparatus to:
word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
In one alternative, the executable instructions cause a computer device/apparatus to:
combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
Carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
In an optional manner, the performing association analysis on each common continuous character subset to obtain a support degree of each common continuous character subset, further includes:
and carrying out relevance analysis on each public continuous character subset based on an Apriori algorithm to obtain the support degree of each public continuous character subset.
In an optional manner, the word segmentation is performed on the entity full scale in the entity library based on the word segmentation library, and a corpus is generated according to the word segmentation result, which further includes:
dividing words of all entity names in the entity library based on the word dividing library to respectively obtain words forming all the entity names;
And establishing a corpus according to the words which form the full names of the entities.
In an optional manner, the calculating, according to the number of occurrences corresponding to each word and a preset rule, a frequency gain corresponding to each component word further includes:
determining the occurrence times corresponding to the component words according to the occurrence times corresponding to the words;
removing each component word one by one, and respectively calculating word occurrence frequency sum corresponding to entities formed by the remaining component words after removal according to the occurrence times corresponding to each component word;
and respectively calculating the frequency gain of each component word based on a greedy algorithm according to the word occurrence frequency sum.
In an alternative form, the executable instructions cause the computer apparatus/device to:
and carrying out logarithmic processing on the occurrence frequency corresponding to each component word.
In an alternative form, the executable instructions cause the computer apparatus/device to:
based on the frequency gain corresponding to each component word, sequencing each component word according to the sequence from small to large;
combining and eliminating each component word according to the frequency gain from small to large, and respectively calculating word occurrence frequency sums corresponding to entities formed by the residual component words after the combination and elimination according to the occurrence times corresponding to each component word;
According to the word occurrence frequency sum, calculating frequency gain of each component word combination based on a greedy algorithm;
and constructing the target entity abbreviation based on the composition word combination with the frequency gain smaller than a preset frequency gain threshold.
In summary, in the computer readable storage medium provided in the foregoing embodiment, the degree of criticality of a word is quantized by the occurrence number of the word, and the frequency gain corresponding to the word is calculated based on the occurrence number of the word, so that the criticality word in the original entity name is reserved according to the frequency gain corresponding to the word to generate an entity abbreviation, thereby realizing automatic, simple and rapid generation of the entity abbreviation with high accuracy.
The embodiment of the invention provides an entity short generation device which is used for executing the entity short generation method.
The embodiment of the invention provides a computer program which can be called by a processor to enable computer equipment to execute the entity short generation method in any of the method embodiments.
An embodiment of the present invention provides a computer program product, where the computer program product includes a computer program stored on a computer readable storage medium, where the computer program includes program instructions, when the program instructions are executed on a computer, cause the computer to execute the entity of any of the above-mentioned method embodiments, abbreviated as generating method.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (10)

1. An entity abbreviated generation method, which is characterized in that the method comprises the following steps:
word segmentation is carried out on the target entity full scale based on a preset corpus, so that each component word forming the target entity full scale is obtained; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
Calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and determining the target entity abbreviation according to the frequency gain corresponding to each component word.
2. The method according to claim 1, wherein before the word segmentation is performed on the target entity name based on the preset corpus to obtain each component word that constitutes the target entity name, the method further comprises:
combining each two entity scales in the set entity library respectively to obtain at least one pair of entity scale combinations;
traversing each pair of entity full scale combinations to obtain a public continuous character subset and a special continuous character subset in each pair of entity full scale combinations; the common continuous character subset is a character string with the length being greater than a preset length threshold, and the specific continuous character subset is a difference set between each entity full name in the entity full name combination and the common continuous character subset;
carrying out relevance analysis on each public continuous character subset to obtain the support degree of each public continuous character subset;
establishing a word segmentation library based on the public continuous character subsets with the support degree larger than a preset support degree threshold value and the specific continuous character subsets;
Performing word segmentation on the entity full names in the entity library based on the word segmentation library, and generating a corpus according to word segmentation results; the corpus comprises at least one word and the occurrence frequency corresponding to each word.
3. The method of claim 2, wherein the performing the association analysis on each of the common continuous character subsets to obtain the support degree of each of the common continuous character subsets further comprises:
and carrying out relevance analysis on each public continuous character subset based on an Apriori algorithm to obtain the support degree of each public continuous character subset.
4. The method of claim 2, wherein the word segmentation is performed on the entity names in the entity library based on the word segmentation library, and generating a corpus according to the word segmentation result, further comprises:
dividing words of all entity names in the entity library based on the word dividing library to respectively obtain words forming all the entity names;
and establishing a corpus according to the words which form the full names of the entities.
5. The method of claim 1, wherein the calculating the frequency gain corresponding to each of the constituent words according to the number of occurrences corresponding to each of the words and a preset rule further comprises:
Determining the occurrence times corresponding to the component words according to the occurrence times corresponding to the words;
removing each component word one by one, and respectively calculating word occurrence frequency sum corresponding to entities formed by the remaining component words after removal according to the occurrence times corresponding to each component word;
and respectively calculating the frequency gain of each component word based on a greedy algorithm according to the word occurrence frequency sum.
6. The method of claim 5, wherein after the obtaining the occurrence frequency corresponding to each of the component words according to the occurrence frequency corresponding to each of the words, further comprises:
and carrying out logarithmic processing on the occurrence frequency corresponding to each component word.
7. The method according to any one of claims 1-6, wherein determining the target entity abbreviation according to the frequency gain corresponding to each of the constituent words further comprises:
based on the frequency gain corresponding to each component word, sequencing each component word according to the sequence from small to large;
combining and eliminating each component word according to the frequency gain from small to large, and respectively calculating word occurrence frequency sums corresponding to entities formed by the residual component words after the combination and elimination according to the occurrence times corresponding to each component word;
According to the word occurrence frequency sum, calculating frequency gain of each component word combination based on a greedy algorithm;
and constructing the target entity abbreviation based on the composition word combination with the frequency gain smaller than a preset frequency gain threshold.
8. An entity abbreviation generating device, characterized in that the device comprises:
the word segmentation module is used for segmenting the target entity full scale based on a preset corpus to obtain each composition word for forming the target entity full scale; wherein the corpus comprises at least one word and the occurrence times corresponding to each word;
the calculating module is used for calculating the frequency gain corresponding to each component word according to the occurrence times corresponding to each word and a preset rule;
and the processing module is used for determining the target entity abbreviation according to the frequency gain corresponding to each component word.
9. A computer device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the entity abbreviation generation method of any of claims 1-7.
10. A computer readable storage medium, wherein at least one executable instruction is stored in the storage medium, which when executed on a computer device/apparatus, causes the computer device/apparatus to perform the operations of the entity abbreviation generation method of any of claims 1-7.
CN202111314465.0A 2021-11-08 2021-11-08 Entity abbreviation generation method, device, computer equipment and storage medium Pending CN116090448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111314465.0A CN116090448A (en) 2021-11-08 2021-11-08 Entity abbreviation generation method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111314465.0A CN116090448A (en) 2021-11-08 2021-11-08 Entity abbreviation generation method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116090448A true CN116090448A (en) 2023-05-09

Family

ID=86201240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111314465.0A Pending CN116090448A (en) 2021-11-08 2021-11-08 Entity abbreviation generation method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116090448A (en)

Similar Documents

Publication Publication Date Title
CN112966522B (en) Image classification method and device, electronic equipment and storage medium
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN113360711A (en) Model training and executing method, device, equipment and medium for video understanding task
CN115482395A (en) Model training method, image classification method, device, electronic equipment and medium
US20220027766A1 (en) Method for industry text increment and electronic device
CN115186738B (en) Model training method, device and storage medium
CN114417856B (en) Text sparse coding method and device and electronic equipment
CN111767419A (en) Picture searching method, device, equipment and computer readable storage medium
CN113361621B (en) Method and device for training model
CN116090448A (en) Entity abbreviation generation method, device, computer equipment and storage medium
CN114841172A (en) Knowledge distillation method, apparatus and program product for text matching double tower model
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN115292506A (en) Knowledge graph ontology construction method and device applied to office field
CN116069914B (en) Training data generation method, model training method and device
CN111914536B (en) Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium
CN115168537B (en) Training method and device for semantic retrieval model, electronic equipment and storage medium
CN116737520B (en) Data braiding method, device and equipment for log data and storage medium
CN112989797B (en) Model training and text expansion methods, devices, equipment and storage medium
CN115033701B (en) Text vector generation model training method, text classification method and related device
CN113392124B (en) Structured language-based data query method and device
CN109635290B (en) Method, apparatus, device and medium for processing information
CN115578583A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113360602A (en) Method, apparatus, device and storage medium for outputting information
CN116226390A (en) Method, device and storage medium for constructing digital battlefield knowledge graph body
CN116090436A (en) Text generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination