CN113705194A - Extraction method and electronic equipment for short - Google Patents

Extraction method and electronic equipment for short Download PDF

Info

Publication number
CN113705194A
CN113705194A CN202110389648.2A CN202110389648A CN113705194A CN 113705194 A CN113705194 A CN 113705194A CN 202110389648 A CN202110389648 A CN 202110389648A CN 113705194 A CN113705194 A CN 113705194A
Authority
CN
China
Prior art keywords
type
word
target
text
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110389648.2A
Other languages
Chinese (zh)
Inventor
铁瑞雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110389648.2A priority Critical patent/CN113705194A/en
Publication of CN113705194A publication Critical patent/CN113705194A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Abstract

The embodiment of the application relates to the technical field of artificial intelligence, and provides a learning algorithm-based short extraction method and electronic equipment. After the electronic equipment obtains the text to be extracted for short, the specific type of each word in the text is analyzed. Wherein the types to which the present technology relates include at least one target type and at least one non-target type. The target type includes a type related to an abbreviation of a target subject. Furthermore, the electronic equipment obtains the abbreviation of the target main body according to the word of which the type is the target type in the text. This ensures the accuracy of the extracted acronyms.

Description

Extraction method and electronic equipment for short
Technical Field
The embodiment of the application relates to the field of Artificial Intelligence (AI), in particular to an extraction method for short and an electronic device.
Background
In expressing information, people often refer to bodies of organizations, and documents by name. Where a full name of a body usually contains relatively many characters, for convenience of expression, people refer to the corresponding body using relatively few characters associated with the body. Hereinafter, "full name of the body" will be described as a full name of the body, and "character related to and relatively less than the body" will be described as an abbreviation of the body. Accordingly, when performing information processing services related to a subject name, such as services of a crawler, organization matching, and the like, it is necessary to extract an abbreviation of a subject so that the subject is clearly related based on the abbreviation of the subject.
A common method for extracting the abbreviation of a main body comprises the steps of presetting a plurality of templates representing the composition structure of the full name and extracting rules corresponding to the abbreviation of each template, and then extracting the abbreviation of the main body from the full name according to the extracting rules corresponding to the template matched with the full name of the main body. For example, the company named template "address information + company identification information + company general information" may correspond to an abbreviation extraction rule "use" company identification information "as company abbreviation". In practical implementation, the existing template is difficult to cover all the full name composition structures required by the service, and further, the accuracy rate of the extracted short names is low. In addition, the correspondence between the template and the rule needs to be maintained, thereby increasing maintenance cost.
Disclosure of Invention
The embodiment of the application provides an abbreviation extraction method and electronic equipment, and aims to solve the problems of low abbreviation accuracy and high maintenance cost obtained based on a corresponding relation between a template and a rule.
In a first aspect, an embodiment of the present application provides an extraction method for short, including:
acquiring a first text;
parsing a type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type, and the target type comprises a type related to the abbreviation of a target subject;
and obtaining the abbreviation of the target main body according to the word with the type as the target type in the first text.
In some possible embodiments, parsing the type of each word in the first text includes:
obtaining semantic representation of each word in the first text;
obtaining the initial probability of each word corresponding to each type according to the semantic representation of each word;
and obtaining the final type of the corresponding word according to the initial probability corresponding to each word.
In some possible embodiments, deriving the final type of each word according to the initial probability corresponding to the word includes:
taking the type corresponding to the maximum initial probability corresponding to each word as the final type of the corresponding word; alternatively, the first and second electrodes may be,
and determining the final type of each word according to the initial probability and the type transition probability of each word corresponding to each type, wherein the type transition probability represents the probability of each type combining with any one of at least one target type and at least one non-target type.
In some possible embodiments, determining the final type of each word based on the initial probability and the type transition probability for each type for each word includes:
determining at least one combination probability corresponding to each word in the first text according to the initial probability and the type transition probability of each word corresponding to each type, wherein the type combination corresponding to each combination probability is obtained by combining each word in the first text in one type, and for any combination, the combination probability is the sum of a first probability and a second probability, the first probability is the sum of the initial probabilities of the types corresponding to each word in the corresponding type combination, and the second probability is the sum of the type transition probabilities between adjacent types in each type contained in the combination;
and determining the type combination corresponding to the maximum combination probability in the at least one combination probability as a target type combination, and determining the type corresponding to each word in the target type combination as the final type of each word.
In some possible embodiments, obtaining the abbreviation of the target subject from the word of the type target in the first text includes:
respectively combining the words with the types as the target types according to at least one rule to obtain at least one abbreviation of the target main body, or,
selecting at least one target rule from at least one rule according to the target type corresponding to the recognized word contained in the first text; and combining the words with the types as the target types according to at least one target rule to obtain at least one abbreviation of the target main body.
In some possible embodiments, the method further comprises:
acquiring a second text set, wherein the second text set comprises at least one full-name text and at least one abbreviation corresponding to the at least one full-name text respectively;
at least one rule is derived based on the type of word that each abbreviation contains in the corresponding full text of the abbreviation.
In some possible embodiments, at least one rule indicates that the elements that make up the abbreviation include a target type that characterizes a target subject identity.
In some of the possible embodiments of the present invention,
if the target subject is an organization, at least one target type comprises a secondary place name, an organization identifier and organization industry information,
the at least one non-target type includes a primary place name, organization type information, and organization branch information.
In some of the possible embodiments of the present invention,
the short-term target main body comprises a word with the type of organization identification and a word corresponding to at least one of the following target types:
and the word corresponding to the secondary place name type or the word corresponding to the organization type information.
In a second aspect, an embodiment of the present application provides an extraction apparatus for short, including:
the acquisition module is used for acquiring a first text;
the analysis module is used for analyzing the type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type, and the target type comprises a type related to the short for the target subject;
and the abbreviation extraction module is used for obtaining the abbreviation of the target main body according to the word with the type as the target type in the first text.
In some possible embodiments, the parsing module is further configured to obtain a semantic representation of each word in the first text; obtaining the initial probability of each word corresponding to each type according to the semantic representation of each word;
and obtaining the final type of the corresponding word according to the initial probability corresponding to each word.
In some possible embodiments, the parsing module is further configured to take a type corresponding to the maximum initial probability corresponding to each word as a final type of the corresponding word; and the analysis module is also used for determining the final type of each word according to the initial probability and the type transition probability of each type corresponding to each word, wherein the type transition probability represents the probability of the combination of each type and any one of at least one target type and at least one non-target type.
In some possible embodiments, the parsing module is further configured to determine, according to an initial probability and a type transition probability that each word corresponds to each type, at least one combination probability that each word in the first text corresponds to, where the type combination that each combination probability corresponds to is obtained by combining the words in the first text by one type, and for any combination, the combination probability is a sum of a first probability and a second probability, where the first probability is a sum of the initial probabilities of the types that each word in the corresponding type combination corresponds to, and the second probability is a sum of the type transition probabilities between adjacent types in the types included in the combination;
the analysis module is further configured to determine a type combination corresponding to a maximum combination probability among the at least one combination probability as a target type combination, and determine a type corresponding to each word in the target type combination as a final type of each word.
In some possible embodiments, the abbreviation extracting module is further configured to obtain at least one abbreviation of the target subject according to the words whose respective rule combination types are the target types;
the extraction module is also used for selecting at least one target rule from at least one rule according to the target type corresponding to the word contained in the identified first text; and combining the words with the types as the target types according to at least one target rule to obtain at least one abbreviation of the target main body.
In some possible embodiments, the obtaining module is further configured to obtain a second text set, where the second text set includes at least one full-name text and at least one abbreviation corresponding to the at least one full-name text;
and the analysis module is also used for obtaining at least one rule according to the type of the word contained in each abbreviation in the full-name text corresponding to the abbreviation.
In some possible embodiments, at least one rule indicates that the elements that make up the abbreviation include a target type that characterizes a target subject identity.
In some possible embodiments, if the target subject is an organization, the at least one target type includes a secondary place name, an organization identification, and organization industry information,
the at least one non-target type includes a primary place name, organization type information, and organization branch information.
In some possible embodiments, the short term target subject includes words whose type is an organization identifier and words corresponding to at least one of the following target types:
and the word corresponding to the secondary place name type or the word corresponding to the organization type information.
In a third aspect, an electronic device is provided, which includes: the electronic device comprises a memory and one or more processors; wherein the memory is for storing a computer program; the computer program, when executed by the processor, causes the electronic device to perform the abbreviation extraction method of the first aspect or any possible implementation manner of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the instructions cause the computer to perform part or all of the steps of the abbreviation extraction method described in the first aspect or any possible implementation manner of the first aspect.
In a fifth aspect, the present application provides a computer program product, where the computer program product includes computer program code, and when the computer program code runs on a computer, the computer is caused to implement the method in the first aspect or any possible implementation manner of the first aspect.
In an implementation manner of the embodiment of the application, the electronic device parses each word in the full-name text according to a preset candidate type, and then obtains at least one abbreviation of the target subject based on the word of which the type is the target type in the full-name text. Therefore, the abbreviation is obtained depending on the type of the character, the accuracy of the extracted abbreviation can be ensured, the full-name texts with different composition structures can be adapted, the electronic equipment does not need to maintain the preset abbreviation extraction rule, and the maintenance cost can be reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below. It should be understood that other figures may be derived from these figures by those of ordinary skill in the art without inventive exercise.
Fig. 1 is a flowchart of a method 10 for extracting abbreviation according to an embodiment of the present application;
fig. 2 is a schematic data flow diagram of an extraction method 20 according to an embodiment of the present disclosure;
FIG. 2A is an exemplary operational flow diagram of a Bert model provided in an embodiment of the present application;
FIG. 2B is an exemplary operational flow diagram of a bidirectional Transformer layer provided by an embodiment of the present application;
FIG. 2C is an exemplary operational flow diagram of an LSTM model provided by an embodiment of the present application;
FIG. 3 is an exemplary operational flow diagram of the full name resolution model of FIG. 2 provided by an embodiment of the present application;
FIG. 4A is a diagram illustrating an exemplary scenario of a full-name resolution output provided by an embodiment of the present application;
fig. 4B is a schematic diagram of an exemplary scenario of abbreviation extraction provided in an embodiment of the present application;
fig. 5A is a schematic diagram illustrating an exemplary composition of an extracting device 50 according to an embodiment of the present disclosure;
fig. 5B is a schematic structural diagram of an electronic device 51 provided in the embodiment of the present application.
Detailed Description
The following describes technical solutions of the embodiments of the present application with reference to the drawings in the embodiments of the present application.
The terminology used in the following examples of the present application is for the purpose of describing particular embodiments and is not intended to be limiting of the technical solutions of the present application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that although the terms first, second, etc. may be used in the following embodiments to describe a class of objects, the objects should not be limited to these terms. These terms are used to distinguish between particular objects of the class of objects. For example, the terms first, second, etc. are used in the following embodiments to describe the probabilities, but the probabilities are not limited to these terms. These terms are only used to distinguish probabilities of different implementations. The following embodiments may adopt the terms first, second, etc. to describe other class objects in the same way, and are not described herein again.
The embodiment of the present application relates to a technology for extracting an abbreviation of a subject from a full name of the subject based on Artificial Intelligence (AI). According to the technical scheme of the embodiment of the application, after the complete name text is obtained, the electronic equipment analyzes the type of each word in the text, and then the short name of the main body is obtained according to the word corresponding to the target type in the type. Therefore, based on the type of each word in the text, the type of each word in the text is obtained through the learning algorithm, and the abbreviation of the main body is obtained according to the type, so that the accuracy rate of the extracted abbreviation can be ensured, the extraction rule does not need to be maintained, and the maintenance cost can be reduced.
The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The AI technology is a comprehensive subject, and relates to the field of extensive technology, both hardware level technology and software level technology. AI base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The AI software Technology mainly includes Computer Vision (CV) Technology, Speech processing (Speech Technology) Technology, Natural Language Processing (NLP) Technology, machine learning/deep learning, and the like.
A learning algorithm is a training method that trains a predetermined target device (e.g., a robot) using a plurality of learning data to trigger, enable, or control the target device for determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
The technical scheme mainly relates to an NLP technology, and the NLP technology is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, and the like.
The embodiment of the application provides an extraction method for short, which can be executed by one electronic device or a computer cluster. The computer cluster comprises at least two electronic devices supporting the abbreviation extraction method of the embodiment of the present application, and any one of the electronic devices can implement the abbreviation extraction function described in the embodiment of the present application by deploying an algorithm model.
Optionally, in a scenario where the abbreviation extraction method is executed by a computer cluster, the computer cluster may be implemented as a block chain, and accordingly, each electronic device in the computer cluster serves as a node in the block chain. That is, in this scenario, the abbreviation extraction method of the embodiment of the present application is executed by at least one node on the blockchain.
Any electronic device related to the embodiments of the present application may be an electronic device such as a mobile phone, a tablet computer, a wearable device (e.g., a smart watch, a smart bracelet, etc.), a notebook computer, a desktop computer, and an in-vehicle device. It is understood that the embodiment of the present application does not set any limit to the specific type of the electronic device.
The "subject" referred to in the embodiments of the present application may be an organization, a mechanism, a document, and the like. Organizations may include corporate enterprises, institutions, communities, and the like. Institutions may include government agencies and research institutions, among others. The documents may include official documents, treatises, and the like. Taking a body implementation as a company as an example, the complete name (also called totally) of a company is "Shenzhen TX computer systems limited", and the acronyms of the company include "TX" and "TX technology", for example.
It should be noted that the text referred to in the embodiments of the present application may include at least two words, and the at least two words are arranged in a certain order to form a sequence. The word described in the embodiment of the present application may be implemented as a chinese character or an english word, such as "cloud", "teng", "calculating", or "tencent". In order to facilitate description of the technical scheme, the description takes the name text composed of the Chinese characters as an example. It should be noted that, although the illustrative examples mentioned in the present specification are all texts composed of chinese characters, these texts do not correspond to any actual organization, and therefore, for convenience of description, some contents in the examples referred to in the present specification are expressed by using english letters. In practical implementations, these english letters can be replaced with actual chinese characters.
It is understood that the embodiment of the present application can also be applied to the extraction technology of target characters in future-oriented texts. The learning algorithm and the service scenario described in the present application are for more clearly illustrating the technical solution of the present application, and do not constitute a limitation to the technical solution provided in the present application, and it can be known by those skilled in the art that the technical solution provided in the present application is also applicable to similar technical problems along with the evolution of the learning algorithm and the appearance of a new service scenario.
The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments.
As shown in fig. 1, an embodiment of the present application provides an extraction method 10 (hereinafter referred to as method 10), where the method 10 includes the following steps:
and step S101, acquiring a first text.
Wherein the first text comprises a full name of the target subject to be recognized. The target subject to be identified may be any of the subjects mentioned above, such as the full name of a company. In some embodiments, the first text is implemented as a full name of the target subject. In other embodiments, the first text may include a full name of the at least one subject, the full name of the at least one subject including a full name of the target subject.
It should be noted that, in an alternative implementation manner in the scenario where the first text contains at least one full name of the body, the at least one body may include bodies with different attributes, for example, the at least one body includes at least one company and at least one organization, such as "shenzhen city TX computer systems limited company" and "chinese IP office". In another alternative implementation, the at least one principal may include different principals of the same property, e.g., the at least one principal is a company, e.g., "Shenzhen TX computer systems Limited" and "Guangzhou WY computer systems Limited".
The first text may be full-name text of user input received by the electronic device, or may be full-name text pre-stored by the electronic device. The electronic equipment can receive a first text typed by a user through characters and can also receive audio input by the user, and then the audio is converted into the text to obtain the first text. In addition, the electronic equipment can also respond to the operation instruction of the user and download the first text from other equipment through the network. The specific obtaining manner may be determined based on the actual application scenario requirements, and is not limited herein.
And step S102, analyzing the type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type.
Wherein the target type includes a type related to an abbreviation of a target subject. "type related to an abbreviation of target subject" refers to a type that can be used to constitute the abbreviation of target subject. Accordingly, the non-target type refers to a type that does not constitute a short name of a target subject. Both target types and non-target types may be preconfigured by the relevant skilled person.
For example, taking the target subject as an organization, the correlation technician may pre-configure the type: the system comprises a first-level place name, a second-level place name, an organization identifier, organization industry information, organization type information and organization branch information. The tissue identification is used for distinguishing the tissue corresponding to the target subject from other tissues of the same type. Organization industry information refers to information in the first text that characterizes the industry in which the organization is engaged, e.g., "media" in the text "media company a" indicates that the organization is engaged in the media industry, and "media" is organization industry information in the text. As another example, the "Internet" in the text "B Internet Inc" indicates that the organization is engaged in the Internet industry, the "Internet" being the organization industry information in this text. The organization type information refers to information characterizing the form of an organization in the first text, for example, "company Limited" in the text "AA company Limited", and further, for example, "company Limited" in the text "B company Limited". The organization branch information is information representing branches such as branches and divisions of an organization in the first text. For example, "Hebei branch" in the text "Hebei branch of bank A," and further, for example, "Hainan branch" in the text "Hainan branch of company B. In this example, the at least one target type may include a secondary place name, an organization identification, and organization industry information, and the at least one non-target type includes a primary place name, organization type information, and organization branch information. That is, the type is a word of the secondary place name, the organization identification, and the organization industry information, and is a word related to the abbreviation of the target subject.
It will be understood that the above examples are illustrative only and are not to be construed as limiting the embodiments of the present application. In some other embodiments, if the target subject is other, for example, a file, the specific implementation manner of the type, the target type, and the non-target type related to the embodiment of the present application may be content matched with a scene corresponding to the file. The embodiments of the present application will not be described in detail herein.
It should be noted that after obtaining the candidate types, the electronic device may use a corpus, which is named a priori and abbreviated, to obtain a model of the type of the parsed word based on a learning algorithm. The embodiment of the application can name the model as a full name analytic model. Related embodiments of the full name resolution model are described in detail below.
Further, after obtaining the first text, the electronic device may use the aforementioned full-name parsing model to parse the final type of each word in the first text. The electronic device may obtain semantic representations of each word in the first text, and then obtain an initial probability of each word corresponding to each type based on the semantic representations of each word to obtain a final type of the corresponding word according to the initial probability of each word.
The semantic representation of the word refers to the characteristic of the word represented by multiple dimensions such as the word itself and the relation between the word and text semantics. Based on the method, the electronic equipment can calculate the initial probability of each word in the first text corresponding to each candidate type according to the semantic representation of the word, so that the possibility that each word in the first text is of each candidate type is quantified, and further, the electronic equipment can determine the final type of each word according to the quantified data (namely the initial probability).
In some embodiments, the aforementioned initial probability may be implemented as a probability that each word is of each candidate type, i.e., a probability distribution of each word in the first text for each candidate type. In this example, for each of the aforementioned words, the electronic device may determine the candidate type corresponding to the maximum value of the initial probabilities corresponding to the word as the final type of the word.
For example, the "department" word may correspond to the type-level place name with an initial probability of 0.05, the type-organization identifier with an initial probability of 0.15, the type-organization industry information with an initial probability of 0.45, the type-organization type information with an initial probability of 0.2, and the type-organization branch information with an initial probability of 0.1.
In other embodiments, to ensure the accuracy of the resulting final type of each word, the electronic device determines the final type of each word based on the initial probability and the type transition probability for each type for each word. Wherein the type transition probabilities characterize the probability of each type in combination with any of the at least one target type and the at least one non-target type. The type transition probability can be obtained according to a large amount of corpus training and is used for representing the relevance between the words. Based on this, after obtaining the initial probability of each word, the electronic device may combine each word to obtain at least one type combination according to the scene that each word corresponds to different types, so as to obtain combinations of different types corresponding to each word with other types, respectively. Further, for each type combination, the electronic device obtains a combined probability for the combination, the combined probability being a sum of the first probability and the second probability. Wherein the first probability is the sum of the initial probabilities of the types corresponding to the words in the combination, and the second probability is the sum of the type transition probabilities between adjacent words in the combination. Further, the electronic device may determine the type corresponding to each word in the type combination corresponding to the maximum combination probability as the final type of each word.
Hereinafter in the first text "ABC" eachWords corresponding to three candidate types are illustrated as examples. Wherein, the initial probabilities of the three candidate types corresponding to the 'A' are PA1、PA2And PA3And the initial probability of the three candidate types corresponding to the B is respectively PB1,PB2And PB3The initial probabilities of "C" corresponding to the three candidate types are PC1,PC2And PC3. The type transition probabilities may include, for example: state transition matrices such as P (a1, a1), P (a1, a2), P (a1, A3), P (a1, B1), P (a1, B2), … … P (C3, C3). Taking P (a1, B1) as an example, P (a1, B1) represents the type transition probability corresponding to the combination of type a1 and type B1. Further, the electronic device may derive, based on the candidate type combinations "a", "B", and "C", for example, a type combination A1B1C1, a type combination A1B1C2, a type combination A1B2C1, a type combination A2B1C1, and the like. Illustratively, the electronic device may combine words in a permutation combination based on three candidate types for each word, not to mention one example here. In the type combination A1B1C2, the word "A" corresponds to the candidate type A1, the word "B" corresponds to the candidate type B1, and the word "C" corresponds to the candidate type C2. Further, the combination probability for this type of combination can be expressed as: pA1B1C2=PA1+PB1+PC2+P(A1,B1)+P(B1,C2),PA1Means the initial probability, P, of the word "A" corresponding to candidate type A1B1Means the initial probability, P, that the "B" word corresponds to candidate type B1C2Refers to the initial probability that the "C" word corresponds to candidate type C2, P (a1, B1) refers to the type transition probability of candidate type a1 in combination with candidate type B1, and P (B1, C2) refers to the type transition probability of candidate type B1 in combination with candidate type C2. The implementation process of the electronic device obtaining the combination probability of other types of combinations and other types of combinations is described in the above description, and is not described herein again. Further, the electronic device may use the type of each word in the type combination corresponding to the maximum combination probability as the final type of each word. For example, the combined probability corresponding to A1B1C2 is the greatest, and then the final type of the word "A" is A1, the final type of the word "B" is B1, and the final type of the word "C" is C2.
Step S103, obtaining the abbreviation of the target subject according to the word with the type being the target type in the first text.
The electronic equipment can obtain at least one abbreviation of the target subject according to the word with the type as the target type in the first text.
For example, after obtaining the target type word, the electronic device may combine the same type words to obtain a corresponding target type word according to a position relationship of each word in the first text. Further, words with a type as a target type may be used to obtain at least one abbreviation of a target subject. The electronic device may combine words to obtain words based on a Named Entity Recognition (NER) technology, for example, and related embodiments are described in detail below.
In practical implementation, the skilled person can obtain at least one rule for obtaining the abbreviation through a learning algorithm based on the existing corpus. For example, the electronic device may obtain a second text set that includes at least one full title text and at least one abbreviation corresponding to the at least one full title text, respectively. Then, the electronic device obtains at least one rule according to the type of the word contained in each abbreviation in the full-name text corresponding to the abbreviation.
For example, an example of the second text set is shown in table 1.
TABLE 1
Figure BDA0003016058710000121
As shown in table 1, the organization's full text "beijing XYZ technology limited" corresponds to three acronyms, "XYZ", "XYZ technology", and "beijing XYZ. The organization's full name text "wuhan MN information technology limited" corresponds to two acronyms, "MN" and "MN information technology," respectively. The relationships between the information in the other rows in table 1 are the same, and are not described herein again.
Based on the corpus provided in table 1, the electronic device may obtain the types of the abbreviations in the corresponding full-name texts, and further, the electronic device may obtain a combination form of the types corresponding to the abbreviations. For example, the word "XYZ" and the word "science and technology" are included in the abbreviation "XYZ science and technology", and the electronic device may, for example, combine the "X", "Y" and "Z" to obtain the type of the word "XYZ" and the word "XYZ" in the full-name text "beijing XYZ science and technology limited" as an organization identifier after obtaining the type of the "X", "Y" and "Z", and similarly, may obtain the type of the word "science and technology" in the full-name text "beijing XYZ science and technology limited" as organization industry information. Further, the electronic device may obtain a combination form of, for example, "XYZ technology" in which a type is a combination of a word identified by an organization and a type is organization industry information. The other short forms of the components in table 1 are the same, and are not described one by one here. Based on this, the electronic device may take a combination form in which the total number is located at the top three as a rule of obtaining abbreviation according to type.
For the sake of brevity, the combination of the word with the type of organization identifier and the word with the type of organization industry information is expressed by the form of "organization identifier + organization industry information" hereinafter. Wherein "+" indicates the meaning of a word of the type before "+" and a word combination of the type before "+". In addition, "+" also indicates the positional relationship of words combined together, for example, "a + B" indicates that the text sequence after combination is "AB" instead of "BA". The description of the type appearing hereinafter in this specification is intended for this purpose and will not be repeated hereinafter.
For example, in one possible design of the present application, the following three rules, the first rule, may be included: organization identification; the second rule is as follows: organization identification + organization industry information; a third rule: second place name + organization identification + organization industry information.
It is to be understood that the above rules are merely illustrative and are not to be construed as limiting the embodiments of the present application. In other embodiments, the short derivation rule may be other rules, or may be more or less rules. The embodiments of the present application do not limit this.
Further, in some embodiments, the electronic device may combine the words with the types as the target types according to at least one rule described above, respectively, to obtain at least one abbreviation of the target subject.
It should be noted that at least one of the rules is obtained based on a corpus, which contains a large number of words of the target type involved in the rule, and in an actual implementation, the full-name text of the target subject does not necessarily contain all the words of the target type involved in the rule, for example, the full-name text of the target subject is "CC chemical industry group", the text contains the word "C" corresponding to the target type organization identifier and the words "chemical" and "worker" corresponding to the organization industry information, and does not contain the word corresponding to the target type secondary place name. Based on this, the electronic device may also obtain the abbreviation according to a rule matching the first text.
Correspondingly, in other embodiments, the electronic device may select at least one target rule from the at least one rule according to a target type corresponding to a word included in the recognized first text, and then combine words of which types are target types according to the at least one target rule to obtain at least one abbreviation of a target subject.
For example, after obtaining the word "C" corresponding to the organization identifier and the words "transformation" and "worker" corresponding to the organization industry information, the electronic device may obtain, from the three rules, a rule that includes the target type "organization identifier" and "organization industry information" and does not include the target type "secondary place name", that is, the first rule (organization identifier) and the second rule (organization identifier + organization industry information). Then, the electronic device can obtain the corresponding abbreviation of the "CC chemical industry group" according to the first rule and the second rule, respectively, to obtain the abbreviation of "CC" and "CC chemical industry".
In a possible implementation manner, the function of the at least one rule may be implemented by an extraction model for short. The details of the related embodiments of the extraction model are described below.
By adopting the implementation mode, the abbreviations are obtained through different combination rules, and the number of the obtained abbreviations can be increased.
As can be seen from the foregoing examples, the identification of the target subject is usually a significant identification that distinguishes the target subject from other similar subjects, so in an alternative design, the foregoing at least one rule may each indicate that the element constituting the abbreviation includes a target type that characterizes the identification of the target subject. For example, if the target subject is an organization, the abbreviation of the target subject includes a word whose type is an organization identifier and a word corresponding to at least one of the following target types: and the word corresponding to the secondary place name type or the word corresponding to the organization type information.
It should be understood that the above full text and short examples are illustrative only and are not to be construed as limiting the embodiments of the present application. In other embodiments, the first text related to the embodiment of the present application may also be a full-name text of other subjects, and the corresponding short names may also change with implementation scene changes. The embodiments of the present application do not limit this.
Therefore, by adopting the implementation mode, the electronic equipment adopts a learning algorithm, analyzes each word in the full-name text according to the preset candidate type, and further obtains at least one abbreviation of the target subject based on the word of which the type is the target type in the full-name text. Therefore, the abbreviation is obtained depending on the type of the character, the accuracy of the extracted abbreviation can be ensured, the full-name texts with different composition structures can be adapted, the electronic equipment does not need to maintain the preset abbreviation extraction rule, and the maintenance cost can be reduced.
The method 10 is a description of an abbreviation extraction method of the embodiment of the present application from the perspective of an electronic device, and the abbreviation extraction method of the present application is described below with reference to a learning algorithm related to the embodiment of the present application.
Illustratively, as shown in fig. 2, embodiments of the present application may relate to a data preprocessing module, a full term parsing model, and an extraction model for short. The data preprocessing module is used for filtering an input text sequence to obtain the first text. The full-name analysis model is used for analyzing and obtaining the candidate type of each word in the first text based on a pre-learning algorithm. The abbreviation extraction model can be based on the type of the words obtained by the analysis of the full-name analysis model based on a pre-learning algorithm, and the abbreviation is obtained by combining the words of which the types are target types.
According to the description of the foregoing embodiment, the full-name parsing model is mainly used to achieve three functions of obtaining semantic representation of each word in the first text, obtaining probability of each type corresponding to each word, and determining the type of each word according to the probability. In one possible implementation, the fully-symmetric analytic model may employ a neural network mechanism to implement the above three functions.
For example, obtaining a model of the semantic representation of each word in the first text may include: a Bidirectional Encoder representation (Bert) model based on a transducer Transformer, a lightweight version of Bert (a Lite Bert) model, a neural network structure based on attention mechanism attention, an NLP model based on self-attention (self-attention), and the like. Obtaining a model of the probability of each word corresponding to each type may include: a Long Short-Term Memory (LSTM) model, a Bi-directional Long Short-Term Memory (Bi) model combining a forward LSTM and a backward LSTM, a Recurrent Neural Network (RNN), a Gated round-robin Unit (GRU) model, and other Neural Network models. Determining a type of each word from the probabilities may include: random Field (CRF) models, and the like.
It should be noted that the above models can be combined as required to obtain a fully-named analytic model related to the embodiments of the present application. In some embodiments, the fully qualified analytical model may be implemented as a combination of a Bert model, an LSTM model, and a CRF model. In other embodiments, the fully qualified analytical model may be implemented as a combination of a Bert model and a CRF model. In still other embodiments, the fully-qualified analytical model may be implemented as a combination of a BilSTM model and a CRF model.
In some embodiments, parsing the type of each word in the first text may be accomplished by performing a NER operation on each word by the correlation model. The NER is also called as a proper name recognition, named entity, and refers to recognizing entities having specific meanings in the text, such as a name of a person, a place, an organization, a file, and the like, and words representing subjects such as time, quantity, currency, ratio value, and the like. A named entity can represent an individual tagged with a label, including a logo, place name, etc. For example, the named entity recognition result corresponding to the name text "beijing city PQ group" may include "beijing city (place name entity)", "PQ (organization identification entity)" and "group (organization type information entity)".
In practical implementation, the NER may label and identify word sequences in the first text by means of a Begin-middle-End (BIE) tag and a type tag. The BIE tag is a segmentation tag, B being Begin, indicating the beginning character of a type tag; i, Intermediate, indicating the middle character of a type tag; e, End, indicates the ending character of a type label. The type tag indicates each type defined by those skilled in the art, which is described in the above embodiments and will not be described in detail herein.
For example, the label for each character in the name text "PQ group in Beijing City" may be expressed as: north (B-CITY), Jing (I-CITY), City (E-CITY), Ming (B-ID), color (E-ID) set (B-TYP) and group (E-TYP).
The following takes the Bert model, the LSTM model, and the CRF model as examples to illustrate the processes involved in the fully-symmetric analytic model.
Referring to fig. 2A, fig. 2A is an exemplary operation flow diagram of a Bert model provided in an embodiment of the present application. The Bert model can be used for dividing words and adding labels to an input text to obtain a word vector and a sentence vector of the text, and then semantic representation of each word in the text is obtained based on the word vector and the sentence vector.
Referring again to fig. 2A, the input of the Bert model is, for example, the text "my dog is cut. he works play", and the Bert model divides the text "my dog is cut. he works play" and washes out the punctuation mark ". to obtain the words" my "," dog "," is "," cut "," he "," like ", and" play ". Furthermore, the Bert model adds labels [ CLS ] and [ SEP ] to the text "my dog is cut. he listing" and labels token, segment, and position tokens to represent a word vector, a position vector, and a sentence vector of each word in the text "my dog is cut. he listing" by these labels.
Where the flag [ CLS ] is used to indicate the first character of the text. The tag [ SEP ] is used to indicate the end of a sentence, e.g., the text "my dog is cut. he likes place" contains the sentence "my dog is cut" and the sentence "he likes place", then the tag [ SEP ] is added after the sentence "my dog is cut" and the tag [ SEP ] is added after the sentence "my dog is cut". The label token embeddings are used to identify each word. The tags segment embeddings are used to identify each sentence, for example, to add an identifier "A" to each word in the sentence "my dog is cut" and to add an identifier "B" to each word in the sentence "he likes play". The tag positions embeddings are used to identify the position of each word in the text of each sentence, for example, the position of the word "my" in the text is "1" and the position of the word "cute" in the text is "4".
Further, the Bert model obtains semantic representations of each word in the text based on the aforementioned respective labels.
Optionally, the Bert model may include 12 layers of bidirectional Transformer structures, and the bidirectional Transformer structures may obtain semantic representations of each word in the text according to the word vectors, the position vectors of the words, and the sentence vectors.
Referring to fig. 2B, fig. 2B is an exemplary operation flow diagram of a bidirectional Transformer layer according to an embodiment of the present application. The main functional units of the bidirectional Transformer layer comprise: a Multi-Head Attention (Multi-Head Attention) layer, a Feed Forward Neural Network (Feed Forward Neural Network) layer, a linear (linear) layer, and a logistic regression (softmax) layer. The Multi-Head Attention is used to encode the input vector to obtain the syntax and semantic relationship code of any word and every other word in the text. A Feed-Forward Neural Network (Feed-Forward Neural Network) layer is used for carrying out feature fusion on input feature vectors. The linear (linear) layer is used to linearly transform the input feature vector so that the feature vector corresponds to a word. The softmax layer is used to convert the feature vectors into semantic representations of words.
Referring to fig. 2B, the Bert model performs input embedding (input embedding) on the aforementioned word vector (token entries), position vector (position entries), and sentence vector (segment entries), and the bidirectional Transformer layer performs position Encoding (position Encoding) on the word vector (token entries), the position vector (position entries), and the sentence vector (segment entries), i.e., adds the token entries and the position entries. Then, the encoded vector and sentence vector are input to the Multi-Head Attention. The Multi-Head orientation calculates the vector of each word and the vectors of other words to obtain the syntax and semantic relation codes of any word and other words in the text, obtains the context semantic vector of multiple dimensions of each word, and then inputs the semantic vector into a Feed Forward layer. And the Feed Forward layer performs linear feature fusion on the input vector.
Wherein, the bidirectional transformer obtains semantic representation of each word in two directions of left to right and right to left. A mask technique is typically employed to randomly mask portions of words and then predict the words that are masked based on the words that are not masked. The foregoing process is a feature extraction process for unmask words.
Referring to fig. 2B again, the bidirectional transformer further encodes the mask word through a mask Multi-Head Attention (masked Multi-Head Attention) layer, and then encodes the feature of the un-masked word and the masked word through the next Multi-Head Attention layer to obtain the semantic vector after all the words are encoded. And then, performing feature fusion on the semantic vectors after all the words are coded through a next Feed Forward layer, and inputting the fused feature vectors into a linear layer. The linear layer carries out linear transformation on the input vector to obtain a logic vector corresponding to the number of the text words, and the logic vector is input into the softmax layer. The softmax layer converts the logical vector of each word into a semantic representation of each word.
Before outputting, the bidirectional transformer model also performs output embedding (output embedding), i.e. a Shifted Right operation, on the data to be output, so as to add a start character and an end character to the data to be output, so as to predict the first character and the last character of the output data.
It should be noted that, in the bidirectional transformer model, each Multi-Head attribute layer, each Feed Forward layer, and masked Multi-Head attribute layer perform feature fusion (ADD & norm) on the processed data and the input data of the corresponding layer to obtain data input to the next layer.
It can be understood that the above is only schematically described with an english text, in practical implementation, the input text may be a chinese text, and correspondingly, after segmenting words of the chinese text, a plurality of words are obtained, and then, the Bert model obtains semantic representations of each word, and the specific implementation process is not described here again.
Referring to fig. 2C, fig. 2C is an exemplary operation flow diagram of the LSTM model provided in the embodiment of the present application. The LSTM model can further extract the characteristics of the semantic representation of the words output by the Bert model.
The LSTM model updates the "cell state" through three "gates" to obtain the character of the word. The cell state (C) is used to record all updated context features up to the present time, which can be understood as a long-term memory, i.e. long-term history information. As shown in FIG. 2C, Ct-1Is the previous cell state, CtIs the current output cell state. Gate structures are structures used in the LSTM model to remove or add information to the cellular state, providing a means for selective passage of information. As shown in fig. 2C, the LSTM model includes three gates: a forgetting gate, an input gate (a memory gate) and an output gate. Wherein the forgetting gate is used for deciding what information is discarded from the cell state, the input gate is used for deciding that new information is stored in the cell state, and the output gate is used for deciding the hidden state h at the current momenttWhat should be.
As shown in FIG. 2C, the input information of the forgetting gate is the hidden layer state h at the previous momentt-1And input feature x at the current timet. The forgetting gate specifically outputs a number from '0' to '1' to the cell state Ct-1"0" indicates that the corresponding information is discarded, and "1" indicates that the corresponding information is retained. Then, forget gate output value ft
The input information of the input gate is also the hidden layer state h of the previous momentt-1And input feature x at the current timet. The input gate determines how much new information is added to the previous state matrix, and the tanh layer is based on ht-1And xtAnd generating a candidate value vector, and finally multiplying the new information data and the candidate value vector and outputting.
The input information of the output gate is the hidden layer state h of the previous momentt-1Input feature x at the present timetAnd the currently output cell state Ct. Output gate decision CtIs output and multiplied by a weight, and then the value is normalized by a tanh layer to obtain a hidden state ht
As can be seen, the LSTM model performs feature extraction according to a sequence order (e.g., a time order or a position order), so in this example, the LSTM model can extract semantic features of words layer by layer based on a position relationship of each word in a text, and further obtain a probability distribution that each word corresponds to a different tag.
A CRF model is a conditional probability model that, given one set of variables, solves for another set of variables. Optionally, the CRF model may obtain a preset transition probability for each type to be combined through a learning algorithm. Further, corresponding combination probabilities after different types of combinations are obtained based on the transition probabilities, and the probability distribution output by the LSTM model is constrained based on the combination probabilities.
Referring again to fig. 2, the present embodiment provides an extraction method 20 (hereinafter referred to as method 20). The method 20 is described with the subject being a company as an example. In this example, the type corresponding to the company may be set as: primary place names, secondary place names, company identifications, company industries, company general information and branch company information. Wherein the primary place name indicates information of the country of the company, e.g., china. The type tag corresponding to the primary location name is LOC, for example. The secondary place name indicates the local information of province, city, county, town, etc. of the company, such as Beijing city. The type tag corresponding to the secondary place name is, for example, CIT. The company identification indicates the symbolic name of the company, e.g., TX. The type tag corresponding to the company identification is COM, for example. The company industry indicates company industry attributes, such as the internet. A type tag corresponding to the company industry is for example PRO. Company general information such as general information of company names such as "limited company", "stock limited company", and the like. The type tag corresponding to the company general information is MON, for example. The division information includes, for example, divisions, lines, and sections. The type tag corresponding to the affiliate information is, for example, PAR.
The method 20 comprises the steps of:
in step S201, the data preprocessing module receives a third sample set.
In this example, the third sample set includes, for example, full-name texts of a plurality of companies and initial acronyms corresponding to each full-name text in the plurality of full-name texts. The initial abbreviation can be extracted according to a rule defined based on a full-name template.
Step S202, the data preprocessing module filters the content in the third sample set to obtain a first text, and the first text is input into the full-name parsing model.
It should be noted that the initial abbreviation in the third sample set is obtained according to a preset fixed rule, so that the initial abbreviation generally contains noise information. Based on this, in this example, the data pre-processing module filters the content in the third sample set.
Illustratively, the data pre-processing module may filter the content in the third sample set by performing at least one of a deduplication, a cleaning, or a denoising operation on the third sample set. "deduplication" refers to the abbreviation removing duplicates in the third set of samples. "cleaning" refers to removing an abbreviation that contains an illegal character or whose length does not meet a predetermined condition (for example, the abbreviation is the length of one character). "denoising" refers to removing extra spliced data and English characters, etc., such as the symbol "-".
For example, the company is named as an initial abbreviation "LKS pizza y" of the text "LKS pizza catering management limited, guangzhou city", wherein "y" belongs to additionally spliced english characters. And the data preprocessing module filters the LKS pizza y and then removes the LKS pizza y to obtain the LKS pizza. For another example, an initial abbreviation of the company's full text "orange vehicle (shanghai) network technology limited" is "orange", the initial abbreviation is a length of a character, and does not meet a preset condition, and the data preprocessing module deletes the initial abbreviation from the third sample set.
Step S203, the full name analysis model analyzes the type of each character in the first text, and the type of each character obtained through analysis is input into the extraction model for short.
The type of each word in the first text is one of the types corresponding to the aforementioned companies. The result of parsing the first text by the full-name parsing model is shown in table 2, for example.
TABLE 2
Figure BDA0003016058710000211
As shown in Table 2, the full name resolution model resolves each company full name in Table 2 to obtain the type of each word in each company full name. For example, the analytic company is named as "Guangdong province branch of China JS Bank Limited company" to obtain the type of each character, and then the characters of the same type are combined to obtain a word, so that the type of the Chinese is a first-class place name, "JS bank" is a company identification, "JS bank" is a company general information, "Guangdong province branch" is a branch information. For another example, the parsing company is called "Shenzhen City TX computer systems Limited", the type of "Shenzhen City" is a secondary place name, "the type of" TX "is a company logo," the type of "computer systems" is a company industry, "and the type of" Limited company "is company general information. The relationships between the information in the other rows in table 2 are the same, and are not described herein.
For the sake of understanding by those skilled in the relevant art, the implementation process of the fully-named parsing model is exemplarily described below by taking the Bert model, the lstm model and the CRF model as examples and by taking the type of each word in the parsing company name text "shenzhen TX computer systems limited company".
Referring to FIG. 3, after the text "Shenzhen City TX computer systems Limited" is obtained, the full-name parsing model inputs the text "Shenzhen City TX computer systems Limited" into the Bert model. The Bert model performs operations such as word segmentation and label embedding (token embedding, segment embedding and position embedding) on the text 'Shenzhen TX computer system Limited'. Further, the Bert model can obtain the bidirectional semantic representation of each word in "Shenzhen TX computer systems, Inc." through the bidirectional transformer. Thereafter, the Bert model inputs semantic representations of the respective sub-words into the lstm model. The lstm model calculates the probability distribution of each word in the 6 types based on the position relation of each word in Shenzhen TX computer system, Inc., after obtaining the semantic representation, and then inputs the probability distribution of each word in the 6 types into the CRF model.
It should be noted that the lstm model obtains the initial probability that each word respectively corresponds to the type in 6 by performing NER on each word. Taking "deep" as an example, the lstm model obtains, for example, that the initial probability of a "deep" correspondence (B-LOC, i.e., a character starting from a primary place name) is 0.9, the initial probability of a correspondence (B-CIT, i.e., a character starting from a secondary place name) is 0.85, the initial probability of a correspondence (B-COM, i.e., a character starting from a company logo) is 0.5, the initial probability of a correspondence (B-PRO, i.e., a character starting from a company industry) is 0.3, the initial probability of a correspondence (B-MON, i.e., a character starting from a company general information) is 0.1, and the initial probability of a correspondence (B-PAR, i.e., a character starting from a branch company information) is 0.2.
It should be understood that in practical implementation, the lstm model can also obtain initial probabilities corresponding to other respective scenarios, such as an initial probability corresponding to "deep" (I-LOC, i.e., a character in the middle of a primary place name), and an initial probability corresponding to (I-PRO, i.e., a character in the middle of a company industry). In addition, the lstm model can also obtain initial probabilities that other words in Shenzhen TX computer systems, Inc. correspond to the various types described above, and are not listed here.
In some embodiments, for each word, the lstm model may determine the maximum initial probability corresponding to the word as the type of each word, e.g., determine the type of "deep" as (B-LOC, i.e., the character beginning at the first place name). However, the type determined by the lstm model may be inaccurate. Based on the type transition probability, the CRF model can obtain the type transition probability of each type combination based on corpus learning through a learning algorithm, and further can determine the corresponding combination probability of each character after being combined in different types respectively based on the type transition probability. Then, the CRF model can obtain the final type of each word according to the maximum combination probability and output the final type.
For example, a CRF model may result in a combination of types with "+" as the combination connector: I-CIT (deep) + I-PRO (zhen) + B-COM (city) + B-PRO (t)) + I-MON (x) + B-CIT (meter) + E-COM (meter) + E-MON (machine) + I-MON (system) + I-PRO (system) + B-PRO (there is) + I-PRO (limit) + I-MON (public) + B-PRO (driver) + E-PRO), after which the CRF model can calculate the sum of the initial probabilities (i.e. the first probability) for each word of the type combination corresponding to the respective type, including, for example, the probability that "deep" corresponds to I-CIT, "zhen" corresponds to I-PRO, etc. The CRF model also obtains transition probabilities between each two adjacent types in the combination of types from the type transition probabilities, and calculates the sum of the transition probabilities (i.e., the second probability). The transition probability between every two adjacent types comprises the transition probability of the combination of I-CIT (deep) and I-PRO (Zhen), the transition probability of the combination of I-PRO (Shen) and B-COM (City), and the like. The CRF model can also obtain the combination probability of other types of combinations and other types of combinations, and further determine the final type of each word according to the type of each word in the type combination corresponding to the maximum combination probability, such as obtaining deep (B-CIT), Zhen (I-CIT), City (E-CIT), Teng (B-COM), WeChat (E-COM), counter (B-PRO), calculator (I-PRO), machine (I-PRO), system (E-PRO), finite (B-MON), finite (I-MON), common (I-MON) and department (E-MON). That is, the final type of "deep" is B-CIT, the final type of "Shen" is I-CIT, the final types of other words are similar, and the details are not described herein.
Further, based on the principle of NER, the words can be combined in order according to the BIE label of each word to obtain the target type word. As shown in connection with FIG. 4A, the word "Shenzhen City" is obtained by combining "Shenzhen", "Shenzhen" and "City" according to the BIE identifier, and the type tag of the word "Shenzhen City" is (CIT), i.e. the type of "Shenzhen City" is a secondary place name. The word "TX" is obtained by combining "T" and "X" according to the BIE identification, the type label of "TX" is (COM), i.e., the type of "TX" is the company identification. The combination "computer", "system" and "system" according to the BIE identification results in the word "computer system", the type label of which is (PRO), i.e. the type of "computer system" is the company industry. The word "limited company" is obtained by combining "limited", "public" and "department" according to the BIE identification, and the type label of the word "limited company" is (MON), that is, the type of "limited company" is company general information.
It should be noted that the combination coefficients between the aforementioned types can be obtained based on a large number of vocabulary corpuses by training using Hidden Markov Models (HMMs).
It is understood that the above is a description of the parsing flow of the fully-symmetric parsing model by taking the parsing of "Shenzhen City TX computer systems Limited" as an example. The process of parsing the types of words in the full-name texts of other companies in table 2 by the full-name parsing model according to the embodiment of the present application may refer to the above description, and is not described here again.
Furthermore, FIG. 3 is only a schematic illustration and does not constitute a limitation on the full term analytical model described herein. In other embodiments, the fully-known analytical model according to the embodiments of the present application can also be implemented as a combination of a BiLSTM model and a CRF model or a combination of a Bert model and a CRF model. The embodiments of the present application do not limit this.
Therefore, by adopting the implementation mode, the type of the words in the text is analyzed based on the learning algorithm, the method is not limited by the composition structure of the text, the learning algorithm can automatically learn semantic knowledge and iterate the analysis model, the extraction performance for short can be optimized, the maintenance is not needed, and the maintenance cost can be lower.
And step S204, the abbreviation extraction model obtains at least one abbreviation of each company in the first text according to at least one preset rule.
In this example, for example, three types of the secondary place name, the company identifier, and the company industry are configured as the target type, the primary place name, the company general information, and the branch company information are configured as the non-target type, and a word containing at least the type as the company identifier may be configured for short. As shown in fig. 4B, the rule for obtaining an abbreviation according to a target type by the abbreviation extraction model may include: rule one, company identification; rule two, secondary place name + company identification; rule three, company identification + company industry; rule four, second-level place name + company logo + company industry.
Further, the abbreviation extraction model may obtain the abbreviation of each company involved in the first text according to the above four rules, respectively, to obtain the correspondence between the full name and the abbreviation of each company in the first text.
For example, still take "Shenzhen City TX computer systems Limited" as an example, where the secondary place name is "Shenzhen City", the company logo is "TX", and the company industry is "computer systems". The abbreviation "TX", the abbreviation "shenzhen mart TX", the abbreviation "TX computer system", and the abbreviation "shenzhen mart TX computer system" can be obtained according to the above four rules.
It is to be understood that fig. 2 and 3 are only schematic illustrations and are not limiting on the embodiments of the present application. In other embodiments, the learning algorithm according to the embodiments of the present application may also be other algorithms, for example, any combination of the ALBERT model, the neural network model based on attention or self-attention, and the RNN model, which is obtained as needed. The embodiments of the present application do not limit this.
In addition, although the foregoing embodiments are all referred to as company full name texts and company short names, the embodiments of the present application are not limited thereto, and the technical solutions of the embodiments of the present application can also be applied to scenarios of short names of organization-based full name extraction organizations, scenarios of short names of file-based full name extraction files, and the like. And will not be described in detail herein.
To sum up, in an implementation manner of the embodiment of the application, the electronic device uses a learning algorithm, analyzes each word in the full-name text according to a preset candidate type, and further obtains at least one abbreviation of the target subject based on the word of which the type is the target type in the full-name text. In this way, the abbreviation is obtained depending on the type of the word, which not only ensures the accuracy of the extracted abbreviation, but also adapts to the full-name text of different composition structures. In addition, the type of the words in the text is analyzed based on the learning algorithm, the method is not limited by the composition structure of the text, the learning algorithm can automatically learn semantic knowledge and iterate an analysis model, the extraction performance for short can be optimized, maintenance is not needed, and therefore the maintenance cost can be lowered.
The foregoing embodiments have described various embodiments of the abbreviation extraction method provided in the embodiments of the present application in terms of analysis of the type of word, combination of words of the target type, and other aspects of the action performed by the electronic device, and learning algorithm processing. It should be understood that the embodiments of the present application may implement the above-described functions in hardware or a combination of hardware and computer software in the form of processing steps corresponding to the parsing of the type of word, the combination of words of the target type, and the like. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
For example, if the above implementation steps implement the corresponding functions through software modules. As shown in fig. 5A, the abbreviation extracting apparatus 50 may include an obtaining module 501, an analyzing module 502, and an abbreviation extracting module 503. The abbreviation extraction device 50 may be used to perform some or all of the operations of any of the methods 10-30 described above.
For example: an obtaining module 501, configured to obtain a first text. A parsing module 502 for parsing a type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type, and the target type includes a type related to an abbreviation of a target subject. The abbreviation extracting module 503 is configured to obtain an abbreviation of the target subject according to the word with the type of the target type in the first text.
Therefore, the abbreviation extracting apparatus 50 provided in this embodiment of the present application may parse the type of each word in the first text after obtaining the full-name text (i.e. the first text) of the target subject, and then obtain the abbreviation of the target subject according to the target type corresponding to the word in the type. Therefore, based on the type of each word in the first text, the type of each word in the first text is obtained through the learning algorithm, and the abbreviation of the main body is obtained according to the type, so that the accuracy of the extracted abbreviation can be ensured, the extraction rule does not need to be maintained, and the maintenance cost can be reduced.
Optionally, the parsing module 502 is further configured to obtain a semantic representation of each word in the first text; and obtaining the initial probability of each word corresponding to each type according to the semantic representation of each word, and obtaining the final type of the corresponding word according to the initial probability corresponding to each word.
Optionally, the parsing module 502 is further configured to use a type corresponding to the maximum initial probability corresponding to each word as a final type of the corresponding word. In another example, the parsing module 502 is further configured to determine a final type of each word according to an initial probability and a type transition probability of each word corresponding to each type, wherein the type transition probability characterizes a probability that each type is combined with any one of at least one target type and at least one non-target type.
Optionally, the parsing module 502 is further configured to determine at least one combination probability corresponding to each word in the first text according to the initial probability and the type transition probability of each word corresponding to each type, where the type combination corresponding to each combination probability is obtained by combining each word in the first text by one type, and for any combination, the combination probability is a sum of a first probability and a second probability, where the first probability is a sum of the initial probabilities of the types corresponding to each word in the corresponding type combination, and the second probability is a sum of the type transition probabilities between adjacent types in the types included in the combination. In this example, the parsing module 502 is further configured to determine a type combination corresponding to a maximum combination probability of the at least one combination probability as a target type combination, and determine a type corresponding to each word in the target type combination as a final type of each word.
Optionally, the abbreviation extracting module 503 is further configured to obtain at least one abbreviation of the target subject according to the word whose type is the target type and at least one rule combination type, respectively. In this embodiment, the extraction module 503 is abbreviated to be further configured to select at least one target rule from the at least one rule according to a target type corresponding to a word included in the identified first text; and combining the words with the types as the target types according to at least one target rule to obtain at least one abbreviation of the target main body.
Optionally, in another embodiment of the present application, the obtaining module 501 is further configured to obtain a second text set, where the second text set includes at least one full-name text and at least one abbreviation corresponding to the at least one full-name text. The parsing module 502 is further configured to obtain at least one rule according to a type of a word included in each abbreviation in a corresponding full-name text of the abbreviation.
Optionally, at least one rule indicates that the elements constituting the abbreviation include a target type characterizing a target subject identity.
Optionally, if the target subject is an organization, at least one target type includes a secondary place name, an organization identifier, and organization industry information, and at least one non-target type includes a primary place name, organization type information, and organization branch information.
Optionally, the target subject includes a word whose type is an organization identifier and a word corresponding to at least one of the following target types: and the word corresponding to the secondary place name type or the word corresponding to the organization type information.
It is noted that the parsing module 502 can implement the "parsing model" and the "full-name parsing model" mentioned in the foregoing embodiments. Accordingly, the functionality of the parsing module 502 can be implemented based on a learning algorithm that combines a Bert model, an lstm model, and a CRF model, or a bilstm model, and a CRF model, or a Bert model and a CRF model.
It is understood that the above division of each module/unit is only a division of a logic function, and in actual implementation, the functions of the above modules may be integrated into a hardware entity, for example, the functions of the parsing module 502 and the extracting module 503 may be integrated into a processor, the function of the obtaining module 501 may be integrated into a transceiver, and programs and instructions for implementing the functions of the above modules may be maintained in a memory. For example, fig. 5B provides an electronic device 51, the electronic device 51 including a processor 511, a transceiver 512, and a memory 513 may be included. The transceiver 512 is used to perform the text acquisition in the methods 10 to 30. The memory 513 may be used to store the pre-installed program/code of the foregoing extraction device 50, or may store the code for execution by the processor 301. When the processor 511 executes the code stored in the memory 513, the electronic device 51 is caused to perform part or all of the operations of the methods 10 to 30, which are abbreviated as the extraction methods.
For example, the processor 511 may be configured to parse a type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type, the target type includes a type associated with an abbreviation of the target subject, and obtain the abbreviation of the target subject from the word of the type in the first text as the target type.
The specific implementation process is described in the above exemplary embodiments of the methods 10 to 30, and will not be described in detail here.
In a specific implementation, corresponding to the foregoing electronic device, an embodiment of the present application further provides a computer storage medium, where the computer storage medium disposed in the electronic device may store a program, and when the program is executed, part or all of the steps in each embodiment of the extraction method including methods 10 to 3 may be implemented. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
One or more of the above modules or units may be implemented in software, hardware or a combination of both. When any of the above modules or units are implemented in software, which is present as computer program instructions and stored in a memory, a processor may be used to execute the program instructions and implement the above method flows. The processor may include, but is not limited to, at least one of: various computing devices that run software, such as a Central Processing Unit (CPU), a microprocessor, a Digital Signal Processor (DSP), a Microcontroller (MCU), or an artificial intelligence processor, may each include one or more cores for executing software instructions to perform operations or processing. The processor may be built in an SoC (system on chip) or an Application Specific Integrated Circuit (ASIC), or may be a separate semiconductor chip. The processor may further include a necessary hardware accelerator such as a Field Programmable Gate Array (FPGA), a PLD (programmable logic device), or a logic circuit for implementing a dedicated logic operation, in addition to a core for executing software instructions to perform an operation or a process.
When the above modules or units are implemented in hardware, the hardware may be any one or any combination of a CPU, a microprocessor, a DSP, an MCU, an artificial intelligence processor, an ASIC, an SoC, an FPGA, a PLD, a dedicated digital circuit, a hardware accelerator, or a discrete device that is not integrated, which may run necessary software or is independent of software to perform the above method flows.
Further, a bus interface may also be included in FIG. 5B, which may include any number of interconnected buses and bridges, with one or more processors, represented by a processor, and various circuits of memory, represented by memory, linked together. The bus interface may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver provides a means for communicating with various other apparatus over a transmission medium. The processor is responsible for managing the bus architecture and the usual processing, and the memory may store data used by the processor in performing operations.
When the above modules or units are implemented using software, they may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It should be understood that, in the various embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic thereof, and should not constitute any limitation to the implementation process of the embodiments.
All parts of the specification are described in a progressive mode, the same and similar parts of all embodiments can be referred to each other, and each embodiment is mainly introduced to be different from other embodiments. In particular, as to the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple and reference may be made to the description of the method embodiments in relevant places.
While alternative embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present invention.

Claims (12)

1. An extraction method for short, comprising:
acquiring a first text;
analyzing the type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type, and the target type comprises a type related to short names of target subjects;
and obtaining the abbreviation of the target main body according to the word with the type of the target type in the first text.
2. The method of claim 1, wherein said parsing the type of each word in the first text comprises:
obtaining semantic representation of each word in the first text;
obtaining the initial probability of each word corresponding to each type according to the semantic representation of each word;
and obtaining the final type of the corresponding word according to the initial probability corresponding to each word.
3. The method of claim 2, wherein obtaining the final type of each word according to the initial probability corresponding to each word comprises:
taking the type corresponding to the maximum initial probability corresponding to each word as the final type of the corresponding word; alternatively, the first and second electrodes may be,
and determining the final type of each word according to the initial probability and the type transition probability of each word corresponding to each type, wherein the type transition probability represents the probability that each type corresponds to the first text in combination with any one of the at least one target type and the at least one non-target type.
4. The method of claim 3, wherein said determining a final type for each word based on said initial probability and said type transition probability for each said type for each said word comprises:
determining at least one combination probability corresponding to each word in the first text according to the initial probability and the type transition probability of each word corresponding to each type, wherein the type combination corresponding to each combination probability is obtained by combining each word in the first text in one type, and for any combination, the combination probability is the sum of a first probability and a second probability, the first probability is the sum of the initial probabilities of the types corresponding to each word in the corresponding type combination, and the second probability is the sum of the type transition probabilities between adjacent types in the types contained in the combination;
and determining the type combination corresponding to the maximum combination probability in the at least one combination probability as a target type combination, and determining the type corresponding to each word in the target type combination as the final type of each word.
5. The method according to claim 1, wherein the obtaining the abbreviation of the target subject from the word of the type of the target type in the first text comprises:
respectively combining the words with the types as the target types according to at least one rule to obtain at least one abbreviation of the target main body; alternatively, the first and second electrodes may be,
selecting at least one target rule from at least one rule according to the identified target type corresponding to the words contained in the first text; and combining the words with the types as the target types according to the at least one target rule to obtain at least one abbreviation of the target main body.
6. The method according to claim 1 or 5, characterized in that the method further comprises:
acquiring a second text set, wherein the second text set comprises at least one full-name text and at least one abbreviation corresponding to the at least one full-name text respectively;
and obtaining the at least one rule according to the type of the word contained in each abbreviation in the full-name text corresponding to the abbreviation.
7. The method of claim 5,
the at least one rule indicates that the elements comprising the abbreviation include a target type characterizing the target subject identity.
8. The method of claim 1,
if the target subject is an organization, the at least one target type comprises a secondary place name, an organization identifier and organization industry information,
the at least one non-target type includes a primary place name, organization type information, and organization branch information.
9. The method of claim 8,
the target subject includes a word whose type is an organization identifier and a word corresponding to at least one of the following target types:
and the words corresponding to the secondary place name types or the words corresponding to the organization type information.
10. An extraction device, comprising:
the acquisition module is used for acquiring a first text;
the analysis module is used for analyzing the type of each word in the first text, wherein the type of each word is one of at least one target type and at least one non-target type, and the target type comprises a type related to short for a target subject;
and the abbreviation extraction module is used for obtaining the abbreviation of the target main body according to the word with the type of the target type in the first text.
11. An electronic device, wherein the electronic device comprises memory and one or more processors; wherein the memory is for storing a computer program; the computer program, when executed by the processor, causes the electronic device to perform the abbreviation extraction method of any of claims 1 to 9.
12. A computer-readable storage medium, characterized in that it stores a computer program which, when run on a computer, causes the computer to perform the abbreviation extraction method of any one of claims 1 to 9.
CN202110389648.2A 2021-04-12 2021-04-12 Extraction method and electronic equipment for short Pending CN113705194A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110389648.2A CN113705194A (en) 2021-04-12 2021-04-12 Extraction method and electronic equipment for short

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110389648.2A CN113705194A (en) 2021-04-12 2021-04-12 Extraction method and electronic equipment for short

Publications (1)

Publication Number Publication Date
CN113705194A true CN113705194A (en) 2021-11-26

Family

ID=78647975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110389648.2A Pending CN113705194A (en) 2021-04-12 2021-04-12 Extraction method and electronic equipment for short

Country Status (1)

Country Link
CN (1) CN113705194A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017899A (en) * 2022-04-19 2022-09-06 北京三快在线科技有限公司 Abbreviation generation method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017899A (en) * 2022-04-19 2022-09-06 北京三快在线科技有限公司 Abbreviation generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN106484674B (en) Chinese electronic medical record concept extraction method based on deep learning
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN110427461B (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN113127624B (en) Question-answer model training method and device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111160023B (en) Medical text named entity recognition method based on multi-way recall
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN115688920A (en) Knowledge extraction method, model training method, device, equipment and medium
CN116050352A (en) Text encoding method and device, computer equipment and storage medium
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN113239668B (en) Keyword intelligent extraction method and device, computer equipment and storage medium
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN114281996A (en) Long text classification method, device, equipment and storage medium
CN113705194A (en) Extraction method and electronic equipment for short
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
Vardag et al. Contextual Urdu Text Emotion Detection Corpus and Experiments using Deep Learning Approaches
CN111428005A (en) Standard question and answer pair determining method and device and electronic equipment
CN117290510B (en) Document information extraction method, model, electronic device and readable medium
CN115577680B (en) Ancient book text sentence-breaking method and device and ancient book text sentence-breaking model training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination