CN116976320B

CN116976320B - Mechanism short extraction method, device, computer equipment and storage medium

Info

Publication number: CN116976320B
Application number: CN202311226820.8A
Authority: CN
Inventors: 姜桂林; 贵照众; 刘刚健; 齐雪
Original assignee: Hunan Caixin Digital Technology Co ltd
Current assignee: Hunan Caixin Digital Technology Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2023-12-15
Anticipated expiration: 2043-09-22
Also published as: CN116976320A

Abstract

The embodiment of the application belongs to the technical field of natural language processing, and relates to a mechanism short-term extraction method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: word segmentation is carried out on the mechanism full scale to obtain a plurality of morphemes, and a morpheme sequence is generated; generating a continuous morpheme subsequence of all morpheme sequences; determining a probability calculation mode of each continuous morpheme subsequence, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence; acquiring word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence from a pre-established morpheme library so as to calculate sequence probability of the continuous morpheme subsequence; calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence; and screening target subsequences from the continuous morpheme subsequences according to the obtained sequence scores, and taking the screened target subsequences as a mechanism of a target mechanism for short. The application improves the accuracy of the mechanism short extraction.

Description

Mechanism short extraction method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method and apparatus for extracting a mechanism, a computer device, and a storage medium.

Background

Various institutions such as enterprises, institutions and the like have formal institution scales, the institution scales are usually long, and for convenience in description, institution scales are often adopted for replacement. Therefore, it is particularly important how to generate accurate and useful abbreviations.

The existing mechanism is called generating technology for short, which is generally called resolving mechanism, and the resolving mechanism is divided into a plurality of morphemes, wherein the morphemes comprise region information, root words, industry information, mechanism type information and the like, and then the morphemes are simplified or screened according to a preset strategy and combined into mechanism short, and the root words often determine the mechanism short to a great extent. However, when the root of a word belongs to a common word, the organization abbreviation is also often a common word, for example, the existing abbreviation generation technology extracts the "alpha finite company" for short to obtain "alpha", and the common word "alpha" appears in many places or public opinion, so that the organization abbreviation lacks distinguishing property; when public opinion information is acquired according to the organization abbreviation, a large amount of irrelevant information is acquired. Therefore, the accuracy of the existing mechanism is low in short generation technology.

Disclosure of Invention

The embodiment of the application aims to provide a mechanism short extraction method, a device, computer equipment and a storage medium, so as to solve the problem of low mechanism short extraction accuracy.

In order to solve the above technical problems, the embodiment of the present application provides a mechanism abbreviated as extraction method, which adopts the following technical scheme:

obtaining a mechanism full scale of a target mechanism;

word segmentation is carried out on the mechanism full scale to obtain a plurality of morphemes, and a morpheme sequence is obtained according to each morpheme;

generating a continuous morpheme subsequence of all the morpheme sequences, wherein the continuous morpheme subsequence comprises at least two continuous morphemes;

for each continuous morpheme subsequence, determining a probability calculation mode of the continuous morpheme subsequence according to a preset probability algorithm, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence;

based on the probability calculation mode, acquiring word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence from a pre-established morpheme library so as to calculate sequence probability of the continuous morpheme subsequence;

calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence;

And screening target subsequences from the continuous morpheme subsequences according to the obtained sequence scores, and taking the screened target subsequences as a mechanism of the target mechanism for short.

Further, the step of performing word segmentation processing on the mechanism full scale to obtain a plurality of morphemes and obtaining a morpheme sequence according to each morpheme comprises the following steps:

extracting the branch information of the mechanism in the mechanism full scale through regular sentences;

removing the mechanism branch information from the mechanism full scale to obtain a first mechanism name;

extracting regional information in the first organization name to split the first organization name into regional information and a second organization name;

performing word segmentation processing on the second mechanism name to obtain a plurality of morphemes, and generating an initial morpheme sequence according to each morpheme;

and adding the regional information as morphemes to the head of the initial morpheme sequence to obtain a morpheme sequence.

Further, the step of performing word segmentation on the second organization name to obtain a plurality of morphemes, and generating an initial morpheme sequence according to each morpheme obtained includes:

performing word segmentation on the second mechanism name to obtain a plurality of morphemes, wherein each morpheme has a position sequence, and the position sequence of each morpheme is determined by the position of each morpheme in the second mechanism name;

And generating an initial morpheme sequence according to the morphemes with the position sequence.

Further, before the step of obtaining the mechanism generic name of the target mechanism, the method further comprises:

obtaining common morphemes;

obtaining each morpheme pair of each common morpheme, wherein the common morpheme exists in each morpheme pair, and each morpheme pair comprises two morphemes;

according to a preset text library, calculating word frequency probability of the common morphemes and conditional probability of each morpheme pair of the common morphemes;

and generating a morpheme library according to the word frequency probability and the conditional probability corresponding to each common morpheme.

Further, when the continuous morpheme subsequence includes three morphemes, the probability calculation manner is expressed as:；

wherein A, B, C are each continuous morphemes,a continuous morpheme subsequence of morphemes A, B, C,is the continuous morpheme subsequence->Sequence probability of>Word frequency probability for morpheme A, +.>Word frequency probability for morpheme B, +.>Word frequency probability for morpheme C, +.>Conditional probability for morpheme A followed by morpheme B,/->Conditional probability of following morpheme C for morpheme B, +.>Is a preset minimum probability value.

Further, the calculation formula of the sequence score is expressed as follows: ；

Where score is the sequence score of the continuous morpheme subsequence, P is the sequence probability of the continuous morpheme subsequence, k is the correction factor, len is the sequence length of the continuous morpheme subsequence, and e is the natural logarithm.

Further, the step of selecting the target subsequence from the continuous morpheme subsequences according to the obtained sequence score includes:

selecting a target subsequence with the maximum sequence score from all the continuous morpheme subsequences according to the obtained sequence scores; or,

and selecting a target subsequence with the sequence score exceeding the preset score and the shortest sequence length from the continuous morpheme subsequences according to the obtained sequence scores.

In order to solve the technical problems, the embodiment of the application also provides a mechanism abbreviated as an extraction device, which adopts the following technical scheme:

the full scale acquisition module is used for acquiring the mechanism full scale of the target mechanism;

the full-scale word segmentation module is used for carrying out word segmentation processing on the mechanism full scale to obtain a plurality of morphemes, and obtaining morpheme sequences according to each morpheme;

the subsequence generation module is used for generating continuous morpheme subsequences of all morpheme sequences, wherein the continuous morpheme subsequences comprise at least two continuous morphemes;

The calculation determining module is used for determining a probability calculation mode of each continuous morpheme subsequence according to a preset probability algorithm, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence;

the probability calculation module is used for acquiring word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence from a pre-established morpheme library based on the probability calculation mode so as to calculate the sequence probability of the continuous morpheme subsequence;

the score calculating module is used for calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence;

and the short determining module is used for screening target subsequences from the continuous morpheme subsequences according to the obtained sequence scores, and taking the screened target subsequences as a mechanism short of the target mechanism.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores computer readable instructions, and the processor implements the steps of the mechanism abbreviated extraction method described above when executing the computer readable instructions.

In order to solve the above technical problem, the embodiments of the present application further provide a computer readable storage medium, where computer readable instructions are stored on the computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the mechanism abbreviated as extraction method described above are implemented.

Compared with the prior art, the embodiment of the application has the following main beneficial effects: obtaining the mechanism full name of a target mechanism, performing word segmentation processing to obtain a plurality of morphemes, and generating morpheme sequences according to each morpheme; generating continuous morpheme subsequences of all morpheme sequences to obtain simple expression of all possible morphemes of the mechanism; for each continuous morpheme subsequence, determining a probability calculation mode of the continuous morpheme subsequence according to a preset probability algorithm, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence; acquiring word frequency probability and related conditional probability of each morpheme from a pre-established morpheme library according to a probability calculation mode, and calculating sequence probability of occurrence probability of continuous morpheme subsequences; calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence, wherein the sequence score reflects the semantic value of the continuous morpheme subsequence; and screening target subsequences from the continuous morpheme subsequences according to the sequence scores, and taking the screened target subsequences as mechanism abbreviations of target mechanisms, thereby completing the simplification of the mechanism names. The application calculates based on word frequency probability, conditional probability and sequence length, can extract key morphemes to form mechanism abbreviation, and ensures the accuracy of the mechanism abbreviation.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a facility abbreviation extraction method according to the present application;

fig. 3 is a schematic view of the structure of an embodiment of the mechanism abbreviation extraction device according to the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Description of the embodiments

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for extracting the mechanism abbreviation provided by the embodiment of the present application is generally executed by a server, and correspondingly, the device for extracting the mechanism abbreviation is generally disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow chart of one embodiment of an organization short extraction method according to the present application is shown. The mechanism short-term extraction method comprises the following steps:

In step S201, the mechanism name of the target mechanism is acquired.

In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the mechanism abbreviated extraction method operates may communicate with the terminal device through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.

Specifically, the mechanism of the acquisition target mechanism is collectively called. The target institution may be various types of institutions such as enterprises, institutions, academia, and the like. The institution name may be a formal and standard name of the target institution, for example, a registered name of an insurance company is "XX province YY insurance company ZZ division", and the registered name may be used as the institution name.

Step S202, word segmentation processing is carried out on the mechanism full name to obtain a plurality of morphemes, and a morpheme sequence is obtained according to each morpheme.

Specifically, word segmentation is performed on the mechanism full scale to obtain a plurality of morphemes, and a morpheme sequence can be constructed according to each morpheme.

Further, the step S202 may include: extracting mechanism branch information in the mechanism full scale through regular sentences; removing the branch information of the mechanism from the mechanism full scale to obtain a first mechanism name; extracting regional information in the first organization name to split the first organization name into regional information and a second organization name; word segmentation processing is carried out on the second mechanism name to obtain a plurality of morphemes, and an initial morpheme sequence is generated according to each morpheme; and adding the regional information as morphemes to the head of the initial morpheme sequence to obtain the morpheme sequence.

Specifically, the mechanism branch information in the mechanism's full name is extracted by a regular sentence, and the mechanism branch information is used to indicate that the target mechanism is a branch mechanism, for example, "CS branch office", CS is a place name.

The branch information is removed from the organization's title and the remainder is the first organization name. Region information in the first organization name is extracted to split the first organization name into region information and a second organization name. For example, extracting regional information in the first organization name through a cpca library of Python; the cpca library, which is collectively referred to as the chinese_program_city_area_mapper, is a python module that identifies provinces, cities and regions in simplified chinese strings and enables mapping, verification and simple drawing.

Word segmentation processing is carried out on the second mechanism name to obtain a plurality of morphemes, and an initial morpheme sequence is generated according to each morpheme; the region information is added as a morpheme to the head of the initial morpheme sequence (i.e., the region information is added to the initial morpheme sequence and serves as the first morpheme in the initial morpheme sequence), resulting in a morpheme sequence.

For example, the existing institutions are collectively called as "Hunan lovely baby mother and infant products Limited company", and the institution branch information "Changsha Limited company" in the institutions are extracted through regular sentences and removed to obtain the first institution name "Hunan lovely baby mother and infant products Limited company". The regional information 'Hunan' in the first institution name is extracted, and the second institution name 'lovely baby mother and infant supplies limited company' is left. The second organization name of lovely baby mother and infant supplies limited company is subjected to word segmentation processing to obtain morphemes: "lovely", "baby", "mother and infant products", "limited", to obtain the initial morpheme sequence { "lovely", "baby", "mother and infant products", "limited" }. Then the regional information ' Hunan ' is used as morpheme to be inserted into the initial morpheme sequence, and the final morpheme sequence { ' Hunan ', ' lovely ', ' baby ', ' mother and infant articles ', ' limited company ', ' is obtained.

In the embodiment, the mechanism branch information in the mechanism holonomy is extracted and removed through the regular statement to obtain a first mechanism name so as to complete preprocessing; splitting the first organization name into region information and a second organization name; performing word segmentation processing on the second mechanism name to obtain a plurality of morphemes, and generating an initial morpheme sequence according to each morpheme; and adding the regional information as morphemes to the head of the initial morpheme sequence to obtain the morpheme sequence so as to extract the organization abbreviations according to the morpheme sequence.

Further, the step of performing word segmentation on the second organization name to obtain a plurality of morphemes, and generating an initial morpheme sequence according to each morpheme obtained may include: word segmentation processing is carried out on the second mechanism name to obtain a plurality of morphemes, wherein each morpheme has a position sequence, and the position sequence of each morpheme is determined by the position of each morpheme in the second mechanism name; an initial morpheme sequence is generated from each morpheme with a position order.

Specifically, the second organization name is subjected to word segmentation processing to obtain a plurality of morphemes. Morphemes are text fragments split from the second organization name that have location information in the second organization name, which also brings the morphemes into a location order. In the second organization name "lovely baby mother and infant supplies limited", taking the above example, the "lovely" appears before the "baby", and thus the morpheme "lovely" is located in sequence before the morpheme "baby".

An initial morpheme sequence is generated from each morpheme with a position order. It will be appreciated that each morpheme in the morpheme sequence also corresponds to a positional sequence. When the region information is added to the initial morpheme sequence, the position sequence of the region information is 1, that is, all morphemes in the initial morpheme sequence need to be extended one bit backward before all morphemes in the initial morpheme sequence. In the subsequent processing, the sequence of positions of the morphemes is involved, e.g. the sequence of positions determines which morphemes are connected, so that a sequence of consecutive morphemes can be generated.

In this embodiment, the second mechanism name is subjected to word segmentation processing to obtain a plurality of morphemes, and the position sequence of each morpheme can be determined according to the position of each morpheme in the second mechanism name; according to each morpheme with a position sequence, an initial morpheme sequence is generated, and the correct generation of a subsequent continuous morpheme subsequence is ensured.

In step S203, a continuous morpheme sub-sequence of all morphemes is generated, the continuous morpheme sub-sequence comprising at least two continuous morphemes.

Specifically, a continuous morpheme subsequence of all morpheme sequences is generated. The continuous morpheme subsequence comprises at least two continuous morphemes, and the continuity between morphemes is determined by the position sequence of the morphemes.

The longest contiguous morpheme subsequence may contain all morphemes. In connection with the foregoing example, the generated continuation morpheme subsequence may include: { "Hunan", "lovely" }, { "lovely", "" }, { "Hunan", "lovely" }, { "lovely", "baby", "mother and infant articles" }, etc., are not all listed herein.

Step S204, for each continuous morpheme sub-sequence, determining a probability calculation mode of the continuous morpheme sub-sequence according to a preset probability algorithm, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme sub-sequence.

Specifically, the application provides a probability algorithm for calculating the sequence probability (namely the occurrence probability) of the continuous morpheme subsequence. The specific probability calculation method is different for consecutive morpheme subsequences of different morpheme lengths (i.e., the number of morphemes included in the consecutive morpheme subsequence). However, the probability calculation formula for the n+1 morpheme length continuous morpheme subsequences is an iteration over the probability calculation formula for the N (N is a positive integer) morpheme length continuous morpheme subsequences.

When a sequence of consecutive morphemes contains two morphemes, the probability calculation is expressed as: （1）

Wherein A, B are each continuous morphemes,a continuous morpheme subsequence consisting of morpheme A, B->Is the continuous morpheme subsequence->Sequence probability of>Word frequency probability for morpheme A, +.>The word frequency probability for morpheme B,conditional probability for morpheme A followed by morpheme B,/->Is a preset minimum probability value.

Further, when the continuous morpheme subsequence contains three morphemes, the probability calculation manner is expressed as:（2）

As can be seen from comparing the formulas 1 and 2, when the morpheme length of the continuous morpheme subsequence is changed from 2 to 3, a new continuous multiplication factor is added to the calculation formulaThe running factor relates to the second morpheme and the third morpheme in the sequence of consecutive morpheme sub-sequences. By mathematical induction, the probability calculation mode of continuous morpheme subsequences with longer morpheme length can be deduced.

When a sequence of consecutive morphemes contains i morphemes, the probability calculation is expressed as:（4）

wherein,、/>、/>、…、/>、/>respectively continuous morphemes, ++>Is morpheme、/>、/>、…、/>、/>Constituent continuous morpheme subsequences, +.>Is the continuous morpheme subsequence->Sequence probability of>Is a morpheme->Word frequency probability of->Is a morpheme->Word frequency probability of->Is a morpheme->Word frequency probability of->Is a morpheme->Word frequency probability of->Is a morpheme->Postamble->Conditional probability of->Is a morpheme->Postamble->Conditional probability of->Is morphemePostamble->Conditional probability of->Is a preset minimum probability value.

In this embodiment, according to the word frequency probability of each morpheme in the continuous morpheme subsequence and the related conditional probability, the sequence probability of the continuous morpheme subsequence may be calculated, and the occurrence probability of the continuous morpheme subsequence may be accurately estimated and calculated.

In general, the higher the word frequency probability, the more common the representative morphemes, the lower the entropy and the lower the information content. In the application, the morphemes with too high word frequency probability are not selected, and the morphemes which are too common appear in a plurality of places or public opinion, so that the institution is short for distinguishing; when public opinion information is acquired according to the organization abbreviation, a large amount of irrelevant information is acquired. The morphemes with low word frequency probability are not easy to hit a large amount of public opinion information, the entropy is high, and the information content is also large.

The conditional probability can avoid that some groups with fixed collocations but little use are screened out. For example, both A and B are rare, but AB is a fixed phrase, and the limit conditions may be P (AB) ≡P (A) ≡P (B). Such as "gulosity", basically P (gulosity) =p (be greedy for food) =p (epicenter), P (be greedy for food |epicenter) =1. If the conditional probability is not used, but rather the word frequency probability is directly used, the probability calculated for P (gulite) is much less than the actual probability.

Also, for example, "Blooming flowers" are common words, but sometimes the word segmentation process is divided into "Blooming flowers" and "Blooming flowers". The occurrence probability of "flower good" is very low, so that P (flower good) ×p (month circle) is very low, and the "flower good" may reach the threshold. However, after the conditional probability is calculated, since P (the moon | flowers well) is very high, the overall probability drops less rapidly, so the threshold is not reached.

Step S205, based on a probability calculation mode, the word frequency probability and the conditional probability of each morpheme in the continuous morpheme subsequence are obtained from a pre-established morpheme library so as to calculate the sequence probability of the continuous morpheme subsequence.

Specifically, based on the probability calculation mode, the word frequency probability of the morpheme to be acquired in calculation and the conditional probabilities of the morpheme can be known. The application establishes a morpheme library in advance, and the word frequency probability of a large number of morphemes and the conditional probability among different morphemes are recorded in the morpheme library. Accessing a morpheme library, acquiring the required word frequency probability and the probability value of the conditional probability according to a probability calculation mode, and calculating the sequence probability of the continuous morpheme subsequence according to the probability calculation mode.

Further, before the step S201, the method may further include: obtaining common morphemes; obtaining each morpheme pair of each common morpheme, wherein the common morphemes exist in each morpheme pair, and each morpheme pair comprises two morphemes; according to a preset text library, calculating word frequency probability of common morphemes and conditional probability of each morpheme pair of the common morphemes; and generating a morpheme library according to the word frequency probability and the conditional probability corresponding to each common morpheme.

Specifically, each common morpheme, for example, morpheme A, B, C in formula (2), is obtained. Each morpheme pair of each common morpheme is obtained, the common morpheme exists in each morpheme pair, and each morpheme pair comprises two morphemes. The morphemes in the morpheme pair also have an order which determines the order of the two morphemes which are connected in front of and behind each other, for example, the morpheme pair AB and the morpheme pair BA are different, and the morpheme pair AB represents the morpheme A and then the morpheme B, and the corresponding conditional probability is thatThe method comprises the steps of carrying out a first treatment on the surface of the The morpheme pair BA represents morpheme B followed by morpheme A, and the corresponding conditional probability is +.>。

And acquiring a preset text library, wherein the text library contains a large amount of texts. Based on a preset text library, word frequency probability of common morphemes and conditional probability of each morpheme pair of the common morphemes can be calculated. According to the word frequency probability and the conditional probability corresponding to each common morpheme, a morpheme library can be constructed.

In the embodiment, each common morpheme is obtained, each morpheme pair of each common morpheme is obtained, and the morpheme pair represents two morphemes which appear in front-back connection; based on a preset text library, calculating word frequency probability of common morphemes and conditional probability of each morpheme pair of the common morphemes; and generating a morpheme library according to the word frequency probability and the conditional probability corresponding to each common morpheme, and preparing data for probability calculation of subsequent continuous morpheme subsequences.

Step S206, calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence.

Specifically, the continuation morpheme subsequence has a sequence length, which may be the total number of characters in the continuation morpheme subsequence, and in one embodiment, the number of morphemes in the continuation morpheme subsequence may also be referred to as the sequence length.

And calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence and a preset sequence score calculation formula. The sequence score represents the semantic value of the sequence of consecutive morphemes.

Further, the calculation formula of the sequence score is expressed as:（3）

In this embodiment, the sequence score is calculated according to the sequence probability and the sequence length of the continuous morpheme subsequence, and the semantic value of the continuous morpheme subsequence is accurately evaluated.

Step S207, selecting target subsequences from the continuous morpheme subsequences according to the obtained sequence scores, and taking the selected target subsequences as the mechanism of the target mechanism for short.

Specifically, the sequence scores of the continuous morpheme subsequences are compared, a target subsequence is selected from the continuous morpheme subsequences according to the comparison result, and the selected target subsequence is used as a mechanism of a target mechanism for short.

The application can extract the mechanism abbreviation which accurately and simply expresses the mechanism full name. When public opinion information is acquired according to organization abbreviations, the number and probability of occurrence of irrelevant public opinion can be reduced.

Further, the step of selecting the target subsequence from the continuous morpheme subsequences according to the obtained sequence score may include: selecting a target subsequence with the maximum sequence score from all the continuous morpheme subsequences according to the obtained sequence scores; or selecting a target subsequence with the sequence score exceeding the preset score and the shortest sequence length from the continuous morpheme subsequences according to the obtained sequence scores.

Specifically, comparing the obtained sequence scores, selecting the maximum sequence score, and taking the continuous morpheme subsequence corresponding to the maximum sequence score as a target subsequence, namely selecting the continuous morpheme subsequence with the highest semantic value as the target subsequence. For example, the continuous morpheme subsequence { "Hunan", "lovely", "baby", "mother and infant" } has the highest sequence score, and the continuous morpheme subsequence is selected to obtain the baby mother and infant which is lovely in Hunan for organization.

Or, acquiring a preset score, and selecting a continuous morpheme subsequence with the shortest sequence length from continuous morpheme subsequences with the sequence scores exceeding the preset score as a target subsequence, so that the target subsequence is ensured to have the simplest expression under the condition of higher semantic value.

In this embodiment, a target subsequence having a maximum sequence score is selected from among the continuous morpheme subsequences; or selecting a target subsequence with the sequence score exceeding the preset score and the shortest sequence length from the continuous morpheme subsequences, thereby enriching the selection modes of the target sequence.

In the embodiment, obtaining a mechanism full name of a target mechanism, performing word segmentation processing to obtain a plurality of morphemes, and generating morpheme sequences according to each morpheme; generating continuous morpheme subsequences of all morpheme sequences to obtain simple expression of all possible morphemes of the mechanism; for each continuous morpheme subsequence, determining a probability calculation mode of the continuous morpheme subsequence according to a preset probability algorithm, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence; acquiring word frequency probability and related conditional probability of each morpheme from a pre-established morpheme library according to a probability calculation mode, and calculating sequence probability of occurrence probability of continuous morpheme subsequences; calculating the sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence, wherein the sequence score reflects the semantic value of the continuous morpheme subsequence; and screening target subsequences from the continuous morpheme subsequences according to the sequence scores, and taking the screened target subsequences as mechanism abbreviations of target mechanisms, thereby completing the simplification of the mechanism names. The application calculates based on word frequency probability, conditional probability and sequence length, can extract key morphemes to form mechanism abbreviation, and ensures the accuracy of the mechanism abbreviation.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a mechanism abbreviated as an extraction device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device is particularly applicable to various electronic devices.

As shown in fig. 3, the mechanism abbreviation extraction device 300 according to the present embodiment includes: a full scale acquisition module 301, a full scale word segmentation module 302, a subsequence generation module 303, a calculation determination module 304, a probability calculation module 305, a score calculation module 306, and a determination module 307 for short, wherein:

the full scale acquisition module 301 is configured to acquire a mechanism full scale of a target mechanism.

The full name word segmentation module 302 is configured to perform word segmentation on the mechanism full name to obtain a plurality of morphemes, and obtain a morpheme sequence according to each morpheme.

The subsequence generation module 303 is configured to generate a continuous morpheme subsequence of all morpheme sequences, where the continuous morpheme subsequence includes at least two continuous morphemes.

The calculation determining module 304 is configured to determine, for each continuous morpheme sub-sequence, a probability calculation manner of the continuous morpheme sub-sequence according to a preset probability algorithm, where the probability calculation manner includes a word frequency probability and a conditional probability of each morpheme in the continuous morpheme sub-sequence.

The probability calculation module 305 is configured to obtain, based on a probability calculation manner, word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence from a pre-established morpheme library, so as to calculate a sequence probability of the continuous morpheme subsequence.

The score calculating module 306 is configured to calculate a sequence score of the continuous morpheme subsequence according to the sequence probability and the sequence length of the continuous morpheme subsequence.

The abbreviation determining module 307 is configured to screen a target subsequence from the continuous morpheme subsequences according to the obtained sequence score, and take the screened target subsequence as a mechanism abbreviation of the target mechanism.

In some alternative implementations of the present embodiment, the full term segmentation module 302 may include: the system comprises a branch extraction sub-module, a branch removal sub-module, a region extraction sub-module, a name word segmentation sub-module and a region addition sub-module, wherein:

and the branch extraction sub-module is used for extracting the mechanism branch information in the mechanism holonomy through the regular statement.

And the branch removing sub-module is used for removing the branch information of the mechanism from the mechanism full scale to obtain a first mechanism name.

And the region extraction sub-module is used for extracting region information in the first organization name so as to split the first organization name into region information and a second organization name.

The name word segmentation sub-module is used for carrying out word segmentation processing on the second mechanism name to obtain a plurality of morphemes, and generating an initial morpheme sequence according to each morpheme.

And the region adding sub-module is used for adding the region information serving as morphemes to the head of the initial morpheme sequence to obtain the morpheme sequence.

In some optional implementations of this embodiment, the name-segmentation sub-module may include: the device comprises a name word segmentation unit and a sequence generation unit, wherein:

the name word segmentation unit is used for carrying out word segmentation processing on the second mechanism name to obtain a plurality of morphemes, wherein each morpheme is provided with a position sequence, and the position sequence of each morpheme is determined by the position of each morpheme in the second mechanism name.

And the sequence generating unit is used for generating an initial morpheme sequence according to each morpheme with the position sequence.

In some optional implementations of the present embodiment, the mechanism abbreviated extraction device 300 may further include: the system comprises a morpheme acquisition module, a morpheme pair acquisition module, a common calculation module and a morpheme library generation module, wherein:

and the morpheme acquisition module is used for acquiring each common morpheme.

The morpheme pair acquisition module is used for acquiring each morpheme pair of each common morpheme, the common morpheme exists in each morpheme pair, and each morpheme pair comprises two morphemes.

The common calculation module is used for calculating word frequency probability of common morphemes and conditional probability of each morpheme pair of the common morphemes according to a preset text library.

And the morpheme library generating module is used for generating a morpheme library according to the word frequency probability and the conditional probability corresponding to each common morpheme.

In some alternative implementations of the present embodiment, when the contiguous morpheme subsequence contains three morphemes, the probability calculation is expressed as:；

wherein A, B, C are each continuous morphemes,a continuous morpheme subsequence of morphemes A, B, C,is the continuous morpheme subsequence->Sequence probability of>Word frequency probability for morpheme A, +.>Word frequency probability for morpheme B, +.>Word frequency probability for morpheme C, +.>Conditional probability for morpheme A followed by morpheme B,/- >Conditional probability of following morpheme C for morpheme B, +.>Is a preset minimum probability value.

In the present embodimentIn some alternative implementations, the calculation formula for the sequence score is expressed as:；

In some optional implementations of the present embodiment, the determining module 307 may include: maximum selection sub-module and shortest selection sub-module, wherein:

and the maximum selection sub-module is used for selecting a target sub-sequence with the maximum sequence score from the continuous morpheme sub-sequences according to the obtained sequence score. Or,

And the shortest selecting sub-module is used for selecting a target sub-sequence with the sequence score exceeding a preset score and the shortest sequence length from the continuous morpheme sub-sequences according to the obtained sequence score.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only a computer device 4 having a memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a mechanism abbreviated as extraction method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, for example, execute computer readable instructions of the mechanism abbreviated as extraction method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The computer device provided in the present embodiment may execute the above mechanism abbreviated extraction method. The mechanism abbreviation extraction method here may be the mechanism abbreviation extraction method of each of the above embodiments.

The present application also provides another embodiment, namely, a computer readable storage medium, where computer readable instructions are stored, where the computer readable instructions can be executed by at least one processor, so that the at least one processor performs the steps of the mechanism abbreviated as extraction method.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The mechanism short-term extraction method is characterized by comprising the following steps:

obtaining a mechanism full scale of a target mechanism;

for each continuous morpheme subsequence, determining a probability calculation mode of the continuous morpheme subsequence, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence;

screening target subsequences from the continuous morpheme subsequences according to the obtained sequence scores, and taking the screened target subsequences as mechanism abbreviations of the target mechanism;

the probability calculation mode is expressed as follows: ；

Wherein,、/>、/>、…、/>、/>respectively continuous morphemes, ++>Is a morpheme->、、/>、…、/>、/>Constituent continuous morpheme subsequences, +.>Is a continuous morpheme subsequenceSequence probability of>Is a morpheme->Word frequency probability of->Is a morpheme->Word frequency probability of->Is a morpheme->Word frequency probability of->Is a morpheme->Word frequency probability of->Is a morpheme->Postamble->Conditional probability of->Is a morpheme->Postamble->Conditional probability of->Is morphemePostamble->Conditional probability of->Is a preset minimum probability value;

the calculation formula of the sequence score is expressed as follows:；

2. The method for extracting organization abbreviations according to claim 1, wherein the step of performing word segmentation processing on the organization scales to obtain a plurality of morphemes, and obtaining morpheme sequences according to each morpheme comprises:

3. The method for extracting organization abbreviations according to claim 2, wherein the step of performing word segmentation processing on the second organization name to obtain a plurality of morphemes, and generating an initial morpheme sequence according to each morpheme obtained includes:

4. The facility abbreviation extraction method of claim 1, further comprising, prior to the step of obtaining the facility names of the target facilities:

obtaining common morphemes;

5. The method for extracting organization abbreviation according to claim 1, wherein the step of selecting a target subsequence from each consecutive morpheme subsequence according to the obtained sequence score comprises:

6. A mechanism abbreviated extraction device, comprising:

the calculation determining module is used for determining a probability calculation mode of each continuous morpheme subsequence for each continuous morpheme subsequence, wherein the probability calculation mode comprises word frequency probability and conditional probability of each morpheme in the continuous morpheme subsequence;

the short determining module is used for screening target subsequences from the continuous morpheme subsequences according to the obtained sequence scores, and taking the screened target subsequences as a mechanism short of the target mechanism;

the probability calculation mode is expressed as follows:；

the calculation formula of the sequence score is expressed as follows: ；

7. A computer device comprising a memory and a processor, wherein the memory has stored therein computer readable instructions which when executed by the processor implement the steps of the mechanism short extraction method of any one of claims 1 to 5.

8. A computer readable storage medium, wherein computer readable instructions are stored on the computer readable storage medium, which when executed by a processor, implement the steps of the mechanism short extraction method of any one of claims 1 to 5.