CN111695340B

CN111695340B - Method and device for extracting short names

Info

Publication number: CN111695340B
Application number: CN202010545742.8A
Authority: CN
Inventors: 蔡远航; 郑少杰; 付勇; 范增虎
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2021-12-28
Anticipated expiration: 2040-06-16
Also published as: CN111695340A

Abstract

The invention discloses a method and a device for extracting abbreviation, wherein the method comprises the following steps: acquiring a plurality of texts containing full names; determining a relevancy index of the full name and the plurality of texts according to the distribution of the full name in the plurality of texts; determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts; aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name; and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name. When the method is applied to financial technology (Fintech), the full name of the short name is obtained through multi-layer screening, and the method has higher accuracy compared with a single extraction rule.

Description

Method and device for extracting short names

Technical Field

The invention relates to the field of artificial intelligence in the field of financial technology (Fintech), in particular to an extraction method and device for short.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but due to the requirements of the financial industry on safety and real-time performance, higher requirements are also put forward on the technologies. In the current financial industry, public opinion monitoring of entities has become an important ring for entity management. For example, the entity is an enterprise, and the risk status of the enterprise can be quickly tracked by public opinion data for the enterprise, such as the fluctuation situation of the enterprise, whether the enterprise is exposed to serious liabilities, whether there are illegal acts in the enterprise operation, whether the enterprise has lawsuits, and the like. However, since public opinion reports (e.g. news reports) usually refer to corresponding entities by short names of entities, and there may be no short names of entities recorded in the entity registration information, it is necessary to extract the short names of entities.

The common method for extracting the abbreviation is 'cutting head and removing tail', and the entity is totally divided into three parts: a regional portion, a middle portion, a suffix portion. For example, for "L region X limited", the three parts are "L region", "X" and "limited", respectively, and after the region part and the suffix part are removed, "X" is obtained, which is the abbreviation of the enterprise. Obviously, the forms of the short messages are various, the extraction mode is single, and the accuracy is low, which is a problem to be solved urgently.

Disclosure of Invention

The invention provides a method and a device for extracting a abbreviation, which solve the problem of low accuracy of extraction of the abbreviation in the prior art.

In a first aspect, the present invention provides an extraction method for short, including: acquiring a plurality of texts containing full names; wherein each text comprises a title and/or a text; determining a relevancy index of the full name and the plurality of texts according to the distribution of the full name in the plurality of texts; determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts; aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name; and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name.

According to the method, after multiple texts of the full name are obtained, multiple candidate texts are screened from the multiple texts according to the distribution of the full name in the multiple texts and the relevance index, then, for any candidate text in the multiple candidate texts, the candidate short names contained in the candidate text are extracted according to the text and the first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name, and then, the candidate short names meeting the legality check are screened from the multiple candidate short names to serve as the short names of the full name.

Optionally, the relevance indexes of the full name and the texts are determined according to the distribution of the full name in the texts; the method comprises the following steps: regarding any text in the plurality of texts, taking at least one of the following distribution indexes in the text body of the text as follows: taking the total times of the full name appearing in the text body of the text as the first index; taking the position of the paragraph in the text of the text where the full title appears and the number of times of the paragraph appearing in the text as the second index; using the sentence position of the sentence with the full name in the text of the text and the sentence structure of the sentence as the third index; taking the number of sentences in the text between the first appearance position and the last appearance position of the full name in the text as the fourth index; taking the number of sentences containing the full title in the text between the first appearance position and the last appearance position of the full title in the text body as the fifth index; for any text in the plurality of texts, determining a weight value of each distribution index according to each distribution index in the text of the text; and determining the relevancy index of the full scale and the text of the text according to the weight value of each distribution index.

In the method, for any text in the plurality of texts, a weight value of each distribution index is determined according to each distribution index of the full name in the text body of the text, wherein each distribution index comprises at least one of a first index, a second index, a third index, a fourth index and a fifth index, so that at least one correlation degree of the full name can be represented, and further, a correlation index of the full name and the text body of the text is determined according to the weight value of each distribution index, so that each distribution index is synthesized, and the correlation degree of the full name and the text body of the text is more accurately described.

Optionally, the weight value of the second indicator is determined according to the following method: for each paragraph of the full name, when the paragraph position of the paragraph is a head paragraph or a tail paragraph, setting a first sub-weight value of the paragraph to a first preset value, otherwise, setting the first sub-weight value of the paragraph to a second preset value, wherein the first preset value is greater than the second preset value; setting a second sub-weight value of the paragraph according to the number of times of occurrence of the full name in the paragraph and a preset increasing functional relation; determining the weight value of each paragraph according to the first sub weight value and the second sub weight value of each paragraph; and determining the weight value of the second index according to the weight value of each paragraph.

In the method, since the first segment and the last segment in the general text can represent the meaning of the text better, when the first segment or the last segment is performed, the first sub-weight value of the segment is set to be a first preset value and is larger than the second preset value, and the second sub-weight value of the segment is combined to obtain the weight value of the segment and further obtain the weight value of each segment, so that the correlation between the segment level and the full scale can be represented more accurately by combining the weight values of each segment.

Optionally, the weight value of the third indicator is determined according to the following method: for each sentence in the sentences of the full name, when the sentence position of the sentence is the first sentence or the last sentence of the paragraph, setting the first sub weight value of the sentence to be a third preset value, otherwise, setting the first sub weight value of the sentence to be a fourth preset value, wherein the third preset value is larger than the fourth preset value; when the sentence structure of the sentence is a second preset sentence structure, setting a second sub-weight value of the sentence as a fifth preset value, otherwise, setting the second sub-weight value of the sentence as a sixth preset value, wherein the fifth preset value is smaller than the sixth preset value; determining the weight value of each sentence according to the first sub weight value and the second sub weight value of each sentence; and determining the weight value of the third index according to the weight value of each sentence.

In the above manner, since the first sentence and the last sentence of a paragraph in a general sentence can represent the meaning of a text better, when the sentence position of the sentence is the first sentence or the last sentence of the paragraph, the first sub-weight value of the paragraph is set to be a third preset value and is greater than the fourth preset value, and in addition, the sentence structure of the sentence is combined to obtain a second sub-weight value, and the weight value of the sentence is obtained, and further the weight value of each sentence is obtained, so that the relevance between the sentence level and the full scale is represented more accurately by integrating the weight values of each sentence.

Optionally, the determining, according to the weighted value of each distribution index, a relevancy index of the full name and the body of the text includes: determining the relevancy index of the full scale and the body of the text according to the following formula:

wherein, X₁The weight value of the first index; x_2iA paragraph weight value for the second index for paragraphs where the full name appears, M being the number of paragraphs of the text; x_3jA sentence weight value for the third index for the sentence in which the full title appears, N being the number of sentences of the text; x₄The weighted value of the fourth index; x₅The weight value of the fifth index; and S is the relevancy index of the full name and the text of the text.

In the above mode, X₁Is to take into account the relevance of the full title to the body of the text, X, from the granularity of the word_2iIs to consider the relevance of the full title to the body of the text, X, from the granularity of a paragraph_3jIs from the granularity of sentencesMeasuring the relevance of the full name and the text body of the text, and obtaining the macroscopic relevance of the full name distributed in the full-text range through common consideration of three fine granularities, wherein X₄And X₅The relevance of the full name and the body of the text is considered from the full-text perspective, so that a more accurate relevance index is obtained by considering the relevance of the full name and the body of the text from the full-text perspective.

Optionally, the candidate abbreviation included in the candidate text is extracted according to the title of the candidate text and the longest public subsequence of the full name; the method comprises the following steps: if the title of the candidate text and the fully-called longest public subsequence are substrings of the title of the candidate text, taking the longest public subsequence as a candidate substring; and if the frequency of the candidate substring in the titles of the candidate texts is greater than a preset frequency threshold value, taking the candidate substring as a candidate abbreviation included in the candidate text.

In the above manner, since abbreviations are not necessarily continuous substrings in the full name, but are necessarily continuous substrings in the title, when the title of the candidate text and the longest common subsequence of the full name are substrings of the title of the candidate text, the longest common subsequence is taken as a candidate substring, and then candidate abbreviations are determined from the candidate substrings according to the frequency in the titles of the plurality of candidate texts, so as to screen more conventional abbreviations, and exclude erroneous and uncommon abbreviations.

Optionally, the candidate abbreviation satisfying the validity check in the multiple candidate abbreviations contained in the multiple candidate texts is used as the abbreviation of the full name; the method comprises the following steps: aiming at any one candidate abbreviation in the plurality of candidate abbreviations, inputting the candidate abbreviation into a entity word validity judging model, and if the output result of the entity word validity judging model indicates that the candidate abbreviation is legal, taking the candidate abbreviation as the abbreviation of the full name; the entity word legality distinguishing model is obtained by performing machine learning training according to a preset sample set; any positive sample of the preset sample set is a title and an entity word included in the title; any negative sample of the preset sample set is a title and a non-entity word included in the title.

In the above manner, the entity word validity judging model is obtained by performing machine learning training according to a preset sample set, and meanwhile, the knowledge of the positive sample and the knowledge of the negative sample are learned, so that after the candidate abbreviation is input into the entity word validity judging model, the validity of the candidate abbreviation can be checked according to the knowledge learned by the entity word validity judging model.

Optionally, the candidate abbreviation satisfying the validity check in the multiple candidate abbreviations contained in the multiple candidate texts is used as the abbreviation of the full name; the method comprises the following steps: taking the candidate abbreviation of which the word structures in the candidate abbreviations meet the full-name abbreviation word construction rule as the full-name abbreviation; the full-name short word construction rule is determined according to the full-name word structure.

In the above manner, after determining the word construction rule for the full name abbreviation according to the word structure of the full name, the legality of the candidate abbreviation can be determined by whether the word structures in the plurality of candidate abbreviations meet the word construction rule for the full name abbreviation.

In a second aspect, the present invention provides an extraction apparatus, including: the acquisition module is used for acquiring a plurality of texts containing full names; wherein each text comprises a title and/or a text; the processing module is used for determining the relevancy indexes of the full name and the texts according to the distribution of the full name in the texts; determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts; aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name; and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name.

Optionally, the processing module is specifically configured to: regarding any text in the plurality of texts, taking at least one of the following distribution indexes in the text body of the text as follows: taking the total times of the full name appearing in the text body of the text as the first index; taking the position of the paragraph in the text of the text where the full title appears and the number of times of the paragraph appearing in the text as the second index; using the sentence position of the sentence with the full name in the text of the text and the sentence structure of the sentence as the third index; taking the number of sentences in the text between the first appearance position and the last appearance position of the full name in the text as the fourth index; taking the number of sentences containing the full title in the text between the first appearance position and the last appearance position of the full title in the text body as the fifth index; for any text in the plurality of texts, determining a weight value of each distribution index according to each distribution index in the text of the text; and determining the relevancy index of the full scale and the text of the text according to the weight value of each distribution index.

Optionally, the processing module is specifically configured to: determining the weight value of the second index according to the following mode: for each paragraph of the full name, when the paragraph position of the paragraph is a head paragraph or a tail paragraph, setting a first sub-weight value of the paragraph to a first preset value, otherwise, setting the first sub-weight value of the paragraph to a second preset value, wherein the first preset value is greater than the second preset value; setting a second sub-weight value of the paragraph according to the number of times of occurrence of the full name in the paragraph and a preset increasing functional relation; determining the weight value of each paragraph according to the first sub weight value and the second sub weight value of each paragraph; and determining the weight value of the second index according to the weight value of each paragraph.

Optionally, the processing module is specifically configured to: determining the weight value of the third index according to the following mode: for each sentence in the sentences of the full name, when the sentence position of the sentence is the first sentence or the last sentence of the paragraph, setting the first sub weight value of the sentence to be a third preset value, otherwise, setting the first sub weight value of the sentence to be a fourth preset value, wherein the third preset value is larger than the fourth preset value; when the sentence structure of the sentence is a second preset sentence structure, setting a second sub-weight value of the sentence as a fifth preset value, otherwise, setting the second sub-weight value of the sentence as a sixth preset value, wherein the fifth preset value is smaller than the sixth preset value; determining the weight value of each sentence according to the first sub weight value and the second sub weight value of each sentence; and determining the weight value of the third index according to the weight value of each sentence.

Optionally, the processing module is specifically configured to: determining the relevancy index of the full scale and the body of the text according to the following formula:

Optionally, the processing module is specifically configured to: if the title of the candidate text and the fully-called longest public subsequence are substrings of the title of the candidate text, taking the longest public subsequence as a candidate substring; and if the frequency of the candidate substring in the titles of the candidate texts is greater than a preset frequency threshold value, taking the candidate substring as a candidate abbreviation included in the candidate text.

Optionally, the processing module is specifically configured to: aiming at any one candidate abbreviation in the plurality of candidate abbreviations, inputting the candidate abbreviation into a entity word validity judging model, and if the output result of the entity word validity judging model indicates that the candidate abbreviation is legal, taking the candidate abbreviation as the abbreviation of the full name; the entity word legality distinguishing model is obtained by performing machine learning training according to a preset sample set; any positive sample of the preset sample set is a title and an entity word included in the title; any negative sample of the preset sample set is a title and a non-entity word included in the title.

Optionally, the processing module is specifically configured to: taking the candidate abbreviation of which the word structures in the candidate abbreviations meet the full-name abbreviation word construction rule as the full-name abbreviation; the full-name short word construction rule is determined according to the full-name word structure.

The advantageous effects of the second aspect and the various optional apparatuses of the second aspect may refer to the advantageous effects of the first aspect and the various optional methods of the first aspect, and are not described herein again.

In a third aspect, the present invention provides a computer device comprising a program or instructions for performing the method of the first aspect and the alternatives of the first aspect when the program or instructions are executed.

In a fourth aspect, the present invention provides a storage medium comprising a program or instructions which, when executed, is adapted to perform the method of the first aspect and the alternatives of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below.

Fig. 1 is a schematic flow chart illustrating steps of an extraction method for short according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an extraction device for short according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

The definitions of the terms appearing in the present application are listed first below.

In the course of operating the financial institutions (banking institutions, insurance institutions or security institutions) in business (such as bank loan business, deposit business and the like), in the current financial industry, the monitoring of the entity's public opinion has become an important part of entity management, and whether to accurately extract the entity is important for monitoring the entity's public opinion for short. However, the current method of extracting the head and tail of the user is single, and the accuracy rate is low. This situation does not meet the requirements of financial institutions such as banks, and the efficient operation of various services of the financial institutions cannot be ensured.

For this reason, as shown in fig. 1, the present application provides an extraction method for short.

Step 101: a plurality of texts containing full names are obtained.

Wherein each text includes a title and/or a body.

Step 102: determining a relevancy index of the full name and the plurality of texts according to the distribution of the full name in the plurality of texts; and determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts.

Step 103: and aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name.

Step 104: and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name.

The overall idea of the steps 101 to 104 is to find a plurality of texts of the full name of the entity to be processed, and then gradually identify, extract and describe the abbreviation of the entity to be processed from the plurality of texts through multi-layer screening by text relevance indexes, sentence structures, public subsequences and the like.

The process of step 101 may specifically be as follows:

the method comprises the steps of collecting full-name texts related to entities to be processed on the Internet, wherein the texts can comprise texts of news web media, forums, blogs, microblogs and various information clients, and then uniformly storing the texts in a search engine for supporting subsequent retrieval and query. For example, the Search engine may be ES (Elastic Search): ES is a distributed, scalable real-time search and data analysis engine. The more the returned search results are ranked, the higher the text similarity representing the results and the search language.

The full name of the entity to be processed is used as a retrieval word, and a certain amount of texts are retrieved from an ES search engine according to the condition that a text or title field must contain the full name of the complete entity to be processed. For example, the number of returned texts may be set to not more than 1 ten thousand. It should be noted that the text searched by the search engine is not necessarily the text most related to the full name.

For example, a native ES search engine does not return text that is semantically most relevant to the term, but instead cuts the term back to text whose body contains more terms. For example, if the search term is "china aeroengine group limited company", and "china", "aviation", "engine", "group", "limited", "company" is obtained after word segmentation, the search term appears 5 times in the text of the text a, and the search term appears 1 time in the text of the text B, but the "company" appears 20 times, then the order of the text B in the returned result of ES will be earlier than that of the text a. If the text B has a text content of "…" a morning point … "purchased at a convenience store beside the chinese aeroengine group limited company, such text may affect the accuracy of the subsequent abbreviation extraction and thus needs to be filtered from a plurality of texts.

Then, step 102 is executed to filter out a plurality of candidate texts. For example, 1000 candidate texts with the highest relevance between the text and the search word are selected from 1 ten thousand news.

It should be noted that, in step 102, only the body text in the text may be considered, both the body text and the title of the text may be considered, and the title is calculated as the first segment and also as the first sentence, and the specific implementation manner may be flexibly set according to specific requirements.

In an alternative embodiment of step 102, "determining the relevance index of the full name to the plurality of texts according to the distribution of the full name in the plurality of texts" may be performed in the following manner (hereinafter referred to as "embodiment of determining the relevance index based on the distribution index"):

regarding any text in the plurality of texts, taking at least one of the following distribution indexes in the text body of the text as follows:

and taking the total times of the full name appearing in the text body of the text as the first index. The first index may also be called the term frequency X of the search term₁。

The higher the number of occurrences of the full name in the text, the higher the likelihood that the text will be introduced around the full name, and therefore the higher the degree of correlation.

And taking the position of the paragraph in the text of which the full name paragraph appears and the number of times of the paragraph appearing in the text as the second index. The second index may also be referred to as paragraph weight X₂。

In the general rule of the lines, different paragraph positions have different degrees of importance, for example, the first and last paragraph may highlight the subject matter of the text. The more important the position of the paragraph in which the full term appears and the more times the paragraph appears in the important paragraph, the higher the relevance of the subject matter of the text and the full term is meant.

And taking the sentence position of the sentence with the full name in the text of the text and the sentence structure of the sentence as the third index. The third indicator may also be referred to as sentence weight X₃。

In the general rule of the lines, sentences in a paragraph are in different positions and have different degrees of importance, for example, the first sentence and the last sentence may point out the subject matter of the paragraph. The more important the sentence position where the full term appears and the more times the sentence appears in the important sentence, the higher the relevance between the subject matter of the text and the full term is meant.

And taking the number of sentences in the text between the first appearance position and the last appearance position of the full name in the text body as the fourth index. The fourth metric may also be referred to as search term span X₄。

In the general rule of the lines, the general structure of the general branch is the general line structure. The number of sentences in the text between the first appearance position and the last appearance position of the full scale in the text can represent the sentence range involved by the full scale, and obviously, the larger the sentence range involved, the higher the relevance of the text to the full scale.

And taking the number of sentences of the full name contained in the text between the first appearance position and the last appearance position of the full name in the text body as the fifth index. The fifth index may also be referred to as search term density X₅。

In the general rule of literary works, the core word of a piece of text will be frequently mentioned. The number of sentences for which the full title is contained in the text between the first and last occurrence positions in the body of the text may characterize how frequently the full title is mentioned. Obviously, if the full title is mentioned more frequently in a sentence of a text, the more relevant the text is to the full title.

For any text in the plurality of texts, determining a weight value of each distribution index according to each distribution index in the text of the text; and determining the relevancy index of the full scale and the text of the text according to the weight value of each distribution index.

It should be noted that, there may be various ways to determine the relevance indexes of the full title and the texts according to the distribution of the full title in the texts, for example, determine the sentence vector of the full title and the sentence vector of the texts, determine the relevance indexes of the full title and the texts according to the euclidean distances between the sentence vector of the full title and the sentence vectors of the texts, determine the relevance indexes of the full title and the texts according to the average number of words of the full title occurring time intervals in the texts, determine the relevance indexes of the full title and the texts according to the number proportion of words of the full title in the texts, and so on. The specific determination manner of the weighted values of the first to fifth indexes may be as follows:

(1) more specifically, in an embodiment where the correlation index is determined based on a distribution index, the weight value of the first index may be determined as follows:

mixing X₁A weight value as the first index. Other flexible arrangements are possible, such as X₁By substitution into a weight value evaluation function of the first index, e.g. X₁+1，2X₁And the like.

(2) More specifically, in an embodiment where the correlation index is determined based on a distribution index, the weight value of the second index may be determined as follows:

for each paragraph of the full name, when the paragraph position of the paragraph is a head paragraph or a tail paragraph, setting a first sub weight value of the paragraph to a first preset value, otherwise, setting the first sub weight value of the paragraph to a second preset value, wherein the first preset value is greater than the second preset value. In specific implementation, the weight can be set by setting a front segment or a rear segment or other specific segments (penultimate segments).

And when the number of times of the full name appearing in the paragraph is more than or equal to 1, setting a second sub-weight value of the paragraph according to the number of times of the full name appearing in the paragraph and a preset increasing functional relation, otherwise, setting the second sub-weight value of the paragraph to be 0. In a specific implementation, the second sub-weight value of the paragraph may also be flexibly set according to other scenarios, for example, set to 0.0001.

Determining the weight value of each paragraph according to the first sub weight value and the second sub weight value of each paragraph; and determining the weight value of the second index according to the weight value of each paragraph.

For example, each paragraph is assigned with a first sub-weight value (also referred to as a paragraph position weight) a of the paragraph, and when the position weights of the first and last paragraphs are 2.5, and the segment position weights of the rest paragraphs are 1; when a second sub-weight value (which may also be referred to as a search term density weight) b of each paragraph is calculated, if a search term appears n times in a certain paragraph, if n is 0, the search term density weight b of the paragraph is 0; if n is greater than or equal to 1, the search term density weight of the paragraph is preset increasing function relationship that b is equal to sⁿ ^-1For example, b-1.5^n-1. (this ensures that the relevancy score for a paragraph containing two search terms is higher than the relevancy score for two paragraphs each containing one search term, i.e., the higher the density of the search terms, the higher the relevancy score). Finally, calculating the weight value X of the paragraph₂＝a×b。

(3) More specifically, in an embodiment where the correlation index is determined based on the distribution index, the weight value of the third index may be determined as follows:

and for each sentence in the sentences of the full name, when the sentence position of the sentence is the first sentence or the last sentence of the paragraph, setting the first sub weight value of the sentence to be a third preset value, otherwise, setting the first sub weight value of the sentence to be a fourth preset value, wherein the third preset value is larger than the fourth preset value. In specific implementation, a previous sentence or a next sentence or other specific sentences (last but one sentence) can be set to set the weight.

And when the sentence structure of the sentence is a second preset sentence structure, setting a second sub-weight value of the sentence as a fifth preset value, otherwise, setting the second sub-weight value of the sentence as a sixth preset value, wherein the fifth preset value is smaller than the sixth preset value.

Determining the weight value of each sentence according to the first sub weight value and the second sub weight value of each sentence; and determining the weight value of the third index according to the weight value of each sentence.

For example, a first sub-weight value (which may also be referred to as a sentence position weight) c of a sentence is assigned to each sentence, if a sentence is the first sentence at the beginning of a paragraph or the last sentence at the end of a paragraph, the sentence position weight is 1, and the sentence position weights in other cases are 0.7; then, each sentence is assigned with a second sub-weight value (sentence topic weight) d of the sentence, for example, the second preset sentence structure is: if a sentence contains a search word and does not contain a pause sign, the weight of the theme is 1, if a sentence contains the search word and also contains a pause sign, the weight of the theme is 0.2 (which is to reduce the relevance of the following sentences, namely 'enterprise A, enterprise B, enterprise C and enterprise D. are involved in the current meeting'), and if a sentence does not contain the search word, the weight of the theme of the sentence is 0. Finally, calculating the weight value X of the sentence₃＝c×d。

(4) More specifically, in an embodiment where the correlation index is determined based on the distribution index, the weight value of the fourth index may be determined as follows:

the ratio of the number p of sentences of the text between the first appearance position and the last appearance position of the search word in the text to the total number q of full-text sentences is used as the weight value of the fourth index, namely

(5) More specifically, in an embodiment where the correlation index is determined based on the distribution index, the weight value of the fifth index may be determined as follows:

for example, if the number of sentences containing the search term is r per K sentences averaged between the positions where the search term first appears and where the search term last appears in the text, then

For example, K is 10.

Then, a specific implementation manner of "determining the relevancy index of the full name and the body of the text according to the weighted value of each distribution index" may be as follows:

determining the relevancy index of the full scale and the body of the text according to the following formula:

Wherein, X₁、X_2i、X_3jConsidering the relevance of the full name and the text of the text from the granularity of words, paragraphs and sentences respectively, and considering the relevance calculation of the three granularities as equivalent, the macroscopic relevance of the full name distributed in the full-text range is obtained by adding the relevance indexes of the three fine granularities and the common consideration

And X₄And X₅Is to take into account the relevance of the full title to the body of the text from a full text perspective, and will therefore

And X₄And X₅Is treated as equivalent and is multiplied by X₁、X_2i、X_3jThe calculation of the respective granularities are separated, so that a more accurate relevance index is obtained by considering the relevance of the full name and the text body of the text from the perspective of the full text.

It should be noted that the above formula is only one possible weight setting and calculation formula, and different calculation formulas can be obtained by flexibly setting the calculation formula according to the correlation between the term frequency of the search term, the paragraph weight, the sentence weight, the search term span, the search term density and the correlation index.

Further, according to the relevance indexes of the full name and the texts, determining candidate texts from the texts may be:

and filtering out the texts with the relevance indexes lower than the preset relevance index threshold value and/or lower sequence in the plurality of texts obtained in the step 101. For example, the top 1000 texts are selected as 1000 candidate texts in the relevance ranking, and how many texts are selected if the text is less than 1000, the candidate texts can be selected in other ways, such as removing the top 1 and the bottom 1, and then selecting the candidate texts, and selecting the top percentage of texts, such as selecting the top 15% of texts with relevance indexes as the candidate texts.

In step 103, according to the body of the candidate text and the first preset sentence structure, a specific implementation manner for extracting the candidate abbreviation included in the candidate text may be as follows:

extracting candidate abbreviation from the text of the candidate text: firstly, finding out all full-name positions in the text of the candidate text, then judging whether a full-name preset text range (such as a 1 sentence range) has a first preset sentence structure, for example, the first preset sentence structure is (shortly 'a certain company') ", (hereafter: 'a certain company')", and other sentence structures, if so, extracting the abbreviation in the sentence pattern as a candidate abbreviation according to a specific position preset by the candidate abbreviation in the sentence structure. It should be noted that, since some texts are written as "company", etc., for more accuracy, they may not be directly taken as the final abbreviation, and the subsequent determination is continued.

In step 103, according to the title of the candidate text and the longest common subsequence of the full name, an optional implementation manner of extracting the candidate abbreviation included in the candidate text may be:

if the title of the candidate text and the fully-called longest public subsequence are substrings of the title of the candidate text, taking the longest public subsequence as a candidate substring; and if the frequency of the candidate substring in the titles of the candidate texts is greater than a preset frequency threshold value, taking the candidate substring as a candidate abbreviation included in the candidate text.

Extracting candidate abbreviation from the title of the text: because the abbreviation is usually a subsequence in full name but not necessarily a substring, for example, the abbreviation of "Shenzhen Shenhai Weizhong Bank limited" is "Weizhong bank", but the abbreviation of "Chinese Industrial and commercial Bank" is "Gongchang".

When calculating the common subsequence, the suffix of the whole term, such as "limited company", "limited liability company" (here, the terms "stock", "group", "stock control", "investment" and the like in the whole term need to be preserved) can be removed first. Then, the English brackets in the title of the enterprise full name and all texts are replaced by the Chinese brackets. The full name of the suffix is removed, the longest common subsequence is respectively calculated with the title of each text, and if the calculated subsequence is not the substring in the title of the corresponding text, the longest common subsequence is not reserved (for example, the title "Shenzhen's one bank company" and the full name "Shenzhen's front sea micro mass banking limited company" can calculate that one longest common subsequence is "Shenzhen bank".

And calculating the frequency of all the longest common subsequences, deleting the longest common subsequences (the longest common subsequences with lower frequency are likely to be wrong) with the frequency lower than 5% of the total news (for example, if the frequency is 1000 news, the threshold value is 50) to obtain a set of the longest common subsequences with frequent occurrence, and using the set as a candidate abbreviation extracted from the candidate text.

An alternative implementation of step 104 (hereinafter referred to as implementation one of step 104) is as follows:

and aiming at any one candidate abbreviation in the plurality of candidate abbreviations, inputting the candidate abbreviation into an entity word validity judging model, and if the output result of the entity word validity judging model indicates that the candidate abbreviation is legal, taking the candidate abbreviation as the abbreviation of the full name.

The entity word legality judging model is obtained by performing machine learning training according to a preset sample set; any positive sample of the preset sample set is a title and an entity word included in the title; any negative sample of the preset sample set is a title and a non-entity word included in the title.

It should be noted that, because the abbreviation is also necessarily presented in the text by means of a physical word, and the longest common subsequence such as "shenzhen shanghai zhonghai mikubanksian" may be extracted by "shenzhen shanghai mikubanksian limited," which is not a physical word, the word is not necessarily an abbreviation of an enterprise, and needs to be deleted from the candidate abbreviation set. Whether the candidate abbreviation belongs to the entity word can be judged through an entity word legality judging model.

For example, the entity word validity judgment model may adopt a pre-trained bert chinese model, and the model fine tuning process is as follows: preparing training data: firstly, preparing 30000 titles containing texts with entity words, finding out the entity word 'Zhongbang' in the title as a sentence A and taking the complete title as a sentence B aiming at each title, such as 'Weizhongbang honor Shenzhen financial innovation prize', wherein the marking result of the sentence pair is 'Y' (namely the sentence A is one entity word in the sentence B); then randomly finding out a non-entity word from the title as sentence A, and taking the complete title as sentence B, wherein the labeling result of the sentence pair is "N" (namely sentence A is not an entity word in sentence B).

It should be noted that each entity word included in the title of the text may be sequentially used as a sentence a, and labeled as a sentence pair with a sentence B, that is, the title of the text includes several entity words, several non-entity words are randomly selected from the title of the text, and labeled as a sentence pair with the sentence B. The number of samples labeled "Y" positive samples and "N" negative samples can be set to be as many, resulting in equalized sample data. Further, a sentence pair classification task mode can be selected to finely tune the entity word validity judging model, and a final entity word validity judging model is obtained. Then, whether the candidate abbreviations contained in the candidate texts belong to the entity words or not can be judged, if not, the candidate abbreviations are deleted, otherwise, the candidate abbreviations are reserved.

Another alternative implementation of step 104 (hereinafter referred to as implementation two of step 104) is as follows:

and taking the candidate abbreviation of which the word structures in the candidate abbreviations meet the word construction rule of the full-name abbreviation as the full-name abbreviation.

And determining the full-name short word construction rule according to the full-name word structure.

For example, the full-name abbreviation construction rules may be as follows:

if the character length of the candidate abbreviation is greater than 1, the candidate abbreviation is legal, otherwise, the candidate abbreviation is illegal;

if the full name is started with Chinese and the candidate is started with middle, the result is legal, otherwise the result is illegal;

if the full name starts with the country and the candidate abbreviation starts with the country, the name is legal, otherwise the name is illegal;

if the full name contains Chinese numbers and the candidate short names contain corresponding Chinese numbers, the candidate short names are legal, otherwise, the candidate short names are illegal; for example, "ten groups of medium iron company limited", if the candidate is simply referred to as "medium iron office", it is illegal;

if the full name still contains province and city names after the prefix of the place name is removed, and the candidate abbreviation contains the corresponding province and city names, the name is legal, otherwise, the name is illegal; for example, "china mobile communication group, guangdong limited", if the candidate is simply referred to as "china mobile", it is illegal.

Other short word construction rules can be flexibly supplemented.

It should be noted that the first embodiment of step 104 and the second embodiment of step 104 may be executed by selecting either one of them, or may be executed by combining them. Moreover, step 104 may also be embodied by dependency parsing.

As shown in fig. 2, the present invention provides an extracting apparatus for abbreviation, comprising: an obtaining module 201, configured to obtain multiple texts containing full names; wherein each text comprises a title and/or a text; the processing module 202 is configured to determine relevance indexes of the full title and the plurality of texts according to distribution of the full title in the plurality of texts; determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts; aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name; and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name.

Optionally, the processing module 202 is specifically configured to, for any text in the plurality of texts, take at least one of the following distribution indicators as the distribution indicator in the body of the text: taking the total times of the full name appearing in the text body of the text as the first index; taking the position of the paragraph in the text of the text where the full title appears and the number of times of the paragraph appearing in the text as the second index; using the sentence position of the sentence with the full name in the text of the text and the sentence structure of the sentence as the third index; taking the number of sentences in the text between the first appearance position and the last appearance position of the full name in the text as the fourth index; taking the number of sentences containing the full title in the text between the first appearance position and the last appearance position of the full title in the text body as the fifth index; for any text in the plurality of texts, determining a weight value of each distribution index according to each distribution index in the text of the text; and determining the relevancy index of the full scale and the text of the text according to the weight value of each distribution index.

Optionally, the processing module 202 is specifically configured to: determining the weight value of the second index according to the following mode: for each paragraph of the full name, when the paragraph position of the paragraph is a head paragraph or a tail paragraph, setting a first sub-weight value of the paragraph to a first preset value, otherwise, setting the first sub-weight value of the paragraph to a second preset value, wherein the first preset value is greater than the second preset value; setting a second sub-weight value of the paragraph according to the number of times of occurrence of the full name in the paragraph and a preset increasing functional relation; determining the weight value of each paragraph according to the first sub weight value and the second sub weight value of each paragraph; and determining the weight value of the second index according to the weight value of each paragraph.

Optionally, the processing module 202 is specifically configured to: determining the weight value of the third index according to the following mode: for each sentence in the sentences of the full name, when the sentence position of the sentence is the first sentence or the last sentence of the paragraph, setting the first sub weight value of the sentence to be a third preset value, otherwise, setting the first sub weight value of the sentence to be a fourth preset value, wherein the third preset value is larger than the fourth preset value; when the sentence structure of the sentence is a second preset sentence structure, setting a second sub-weight value of the sentence as a fifth preset value, otherwise, setting the second sub-weight value of the sentence as a sixth preset value, wherein the fifth preset value is smaller than the sixth preset value; determining the weight value of each sentence according to the first sub weight value and the second sub weight value of each sentence; and determining the weight value of the third index according to the weight value of each sentence.

Optionally, the processing module 202 is specifically configured to: determining the relevancy index of the full scale and the body of the text according to the following formula:

wherein, X₁The weight value of the first index; x_2iA paragraph weight value for the second index for paragraphs where the full name appears, M being the number of paragraphs of the text; x_3jFor the third sentence in which the full name appearsThe sentence weight value of the index, N is the sentence number of the text; x₄The weighted value of the fourth index; x₅The weight value of the fifth index; and S is the relevancy index of the full name and the text of the text.

Optionally, the processing module 202 is specifically configured to: if the title of the candidate text and the fully-called longest public subsequence are substrings of the title of the candidate text, taking the longest public subsequence as a candidate substring; and if the frequency of the candidate substring in the titles of the candidate texts is greater than a preset frequency threshold value, taking the candidate substring as a candidate abbreviation included in the candidate text.

Optionally, the processing module 202 is specifically configured to: aiming at any one candidate abbreviation in the plurality of candidate abbreviations, inputting the candidate abbreviation into a entity word validity judging model, and if the output result of the entity word validity judging model indicates that the candidate abbreviation is legal, taking the candidate abbreviation as the abbreviation of the full name; the entity word legality distinguishing model is obtained by performing machine learning training according to a preset sample set; any positive sample of the preset sample set is a title and an entity word included in the title; any negative sample of the preset sample set is a title and a non-entity word included in the title.

Optionally, the processing module 202 is specifically configured to: taking the candidate abbreviation of which the word structures in the candidate abbreviations meet the full-name abbreviation word construction rule as the full-name abbreviation; the full-name short word construction rule is determined according to the full-name word structure.

The embodiment of the present application provides a computer device, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute an abbreviation extraction method and any optional method provided in the embodiment of the present application.

The embodiments of the present application provide a storage medium, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute an abbreviation extraction method and any optional method provided in the embodiments of the present application.

Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A abbreviation extraction method, comprising:

acquiring a plurality of texts containing full names; wherein each text comprises a title and/or a text;

determining a relevance index of the full title to the plurality of texts according to the distribution of the full title in the plurality of texts in the following manner:

taking the total times of the full name appearing in the text of the text as a first index; taking the position of the paragraph in the text of the text where the full title appears and the number of times of the paragraph appearing in the text as a second index; using the sentence position of the sentence with the full name in the text of the text and the sentence structure of the sentence as a third index; taking the number of sentences in the text between the first appearance position and the last appearance position of the full name in the text as a fourth index; taking the number of sentences containing the full name in the text between the first appearance position and the last appearance position of the full name in the text as a fifth index;

for any text in the plurality of texts, determining a weight value of each distribution index according to each distribution index in the text of the text; determining the relevancy index of the full name and the text of the text according to the weight value of each distribution index;

determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts;

aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name;

and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name.

2. The method of claim 1, wherein the weight value of the second indicator is determined as follows:

for each paragraph of the full name, when the paragraph position of the paragraph is a head paragraph or a tail paragraph, setting a first sub-weight value of the paragraph to a first preset value, otherwise, setting the first sub-weight value of the paragraph to a second preset value, wherein the first preset value is greater than the second preset value; setting a second sub-weight value of the paragraph according to the number of times of occurrence of the full name in the paragraph and a preset increasing functional relation;

determining the weight value of each paragraph according to the first sub weight value and the second sub weight value of each paragraph;

and determining the weight value of the second index according to the weight value of each paragraph.

3. The method of claim 1, wherein the weight value of the third indicator is determined as follows:

for each sentence in the sentences of the full name, when the sentence position of the sentence is the first sentence or the last sentence of the paragraph, setting the first sub weight value of the sentence to be a third preset value, otherwise, setting the first sub weight value of the sentence to be a fourth preset value, wherein the third preset value is larger than the fourth preset value; when the sentence structure of the sentence is a second preset sentence structure, setting a second sub-weight value of the sentence as a fifth preset value, otherwise, setting the second sub-weight value of the sentence as a sixth preset value, wherein the fifth preset value is smaller than the sixth preset value;

determining the weight value of each sentence according to the first sub weight value and the second sub weight value of each sentence;

and determining the weight value of the third index according to the weight value of each sentence.

4. The method of claim 1, wherein determining the relevancy index of the full scale and the body of the text according to the weight value of each distribution index comprises:

5. The method of claim 1, wherein the candidate abbreviation contained in the candidate text is extracted according to the longest common subsequence of the title and the full name of the candidate text; the method comprises the following steps:

if the title of the candidate text and the fully-called longest public subsequence are substrings of the title of the candidate text, taking the longest public subsequence as a candidate substring;

and if the frequency of the candidate substring in the titles of the candidate texts is greater than a preset frequency threshold value, taking the candidate substring as a candidate abbreviation included in the candidate text.

6. The method according to any one of claims 1 to 5, wherein the candidate abbreviation satisfying validity check among the plurality of candidate abbreviations contained in the plurality of candidate texts is used as the abbreviation of the full abbreviation; the method comprises the following steps:

aiming at any one candidate abbreviation in the plurality of candidate abbreviations, inputting the candidate abbreviation into a entity word validity judging model, and if the output result of the entity word validity judging model indicates that the candidate abbreviation is legal, taking the candidate abbreviation as the abbreviation of the full name; the entity word legality distinguishing model is obtained by performing machine learning training according to a preset sample set; any positive sample of the preset sample set is a title and an entity word included in the title; any negative sample of the preset sample set is a title and a non-entity word included in the title.

7. The method according to any one of claims 1 to 5, wherein the candidate abbreviation satisfying validity check among the plurality of candidate abbreviations contained in the plurality of candidate texts is used as the abbreviation of the full abbreviation; the method comprises the following steps:

taking the candidate abbreviation of which the word structures in the candidate abbreviations meet the full-name abbreviation word construction rule as the full-name abbreviation; the full-name short word construction rule is determined according to the full-name word structure.

8. An extraction device for short, comprising:

the acquisition module is used for acquiring a plurality of texts containing full names; wherein each text comprises a title and/or a text;

the processing module is used for determining the relevancy indexes of the full name and the texts according to the distribution of the full name in the texts in the following modes:

the processing module is further used for determining a plurality of candidate texts from the plurality of texts according to the relevance indexes of the full name and the plurality of texts; aiming at any candidate text in the candidate texts, extracting candidate abbreviation contained in the candidate text according to the text and a first preset sentence structure of the candidate text and/or according to the title of the candidate text and the longest public subsequence of the full name; and taking the candidate abbreviation which meets the validity check in the candidate abbreviations contained in the candidate texts as the abbreviation of the full name.

9. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 7.