CN113051900B - Synonym recognition method, synonym recognition device, computer equipment and storage medium - Google Patents

Synonym recognition method, synonym recognition device, computer equipment and storage medium Download PDF

Info

Publication number
CN113051900B
CN113051900B CN202110479989.9A CN202110479989A CN113051900B CN 113051900 B CN113051900 B CN 113051900B CN 202110479989 A CN202110479989 A CN 202110479989A CN 113051900 B CN113051900 B CN 113051900B
Authority
CN
China
Prior art keywords
character string
entity
character
string set
strings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110479989.9A
Other languages
Chinese (zh)
Other versions
CN113051900A (en
Inventor
陈岳峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110479989.9A priority Critical patent/CN113051900B/en
Publication of CN113051900A publication Critical patent/CN113051900A/en
Application granted granted Critical
Publication of CN113051900B publication Critical patent/CN113051900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The application belongs to the field of big data, and relates to a synonym recognition method and equipment, wherein the method comprises the steps of searching a first character string set obtained according to a text to be recognized by synonyms and elements of an entity set formed according to a preset knowledge base to generate a second character string set containing recognized character strings; generating a number of synonym pairs based on the three sets to link at least one string in the first string set to the entity; generating a third character string set according to the recognized character strings; performing supplementary recognition and labeling on the character strings in the first character string set based on the first character string set, the third character string set and the entity set; and after updating the third character string set according to the character string which is recognized and marked in a supplementing manner, repeating the recognition and marking of the first character string set until the recognition recall rate meets the preset condition. The present application also relates to blockchain techniques in which private information in the text may be stored. The application improves the recognition accuracy based on the transitivity of the synonyms through active learning.

Description

Synonym recognition method, synonym recognition device, computer equipment and storage medium
Technical Field
The application relates to the technical field of big data, in particular to a synonym recognition method, a synonym recognition device, computer equipment and a storage medium.
Background
Synonym recognition is a more important and fundamental problem in NLP (Natural Language Processing ), and has important applications in knowledge-graph and question-answering systems (Question Answering System, QA), for example, in question-answering systems, where synonym recognition greatly affects question-answering accuracy and coverage. The synonym recognition process specifically finds out the Entity of which the Entity to be recognized is synonymous from the knowledge base, and supposes that the Entity to be recognized is referred to as the identity, and the Entity in the knowledge base is the identity, and the main solution of the identity for recognizing the identity of the identity in the knowledge base generally comprises the following steps:
(1) By means of the existing manually edited knowledge base, such as 'Ha Gong synonym forest extension edition', howNet and the like, the method is simple, easy to obtain and high in accuracy, and has the defect of low coverage rate in a specific vertical field.
(2) By means of context correlation of texts, such as unsupervised Word2Vec and weakly supervised DPE, similarity between each entity to be identified and the entity in the knowledge base is calculated, and the entities in the knowledge bases with similarity Top-K are obtained through sequencing.
(3) By means of text similarity, whether the synonyms are the synonyms is judged by directly calculating the text similarity of the mentions and the Entity, and the method is simple in calculation, does not need large-scale corpus, and has the defect that many wrong synonyms can be mined.
There are some products for synonym recognition in the vertical field, and since the field limitation is difficult to obtain a large-scale corpus, and the Mention lack of useful context information in the corpus in many scenes, such as the user query of QA is generally short, and the context information of Mention is less, solutions (1) and (2) are difficult to realize effective synonym recognition; the calculation of the text similarity is limited by a screening threshold value, the recall rate is low when the threshold value is set to be high, the accuracy is low when the threshold value is set to be low, and the prediction of synonyms is frequently wrong, for example, in the vertical fields of medical profession and the like, the Mention and the Entity are mostly phrases composed of a plurality of words, the Mention is often spoken, the Entity in a knowledge base is written, the text similarity limited by the phrases is not accurate enough to calculate, the spoken and written words are different, and the directly calculated text similarity is inaccurate, so that the final recognition accuracy is affected.
Disclosure of Invention
The embodiment of the application aims to provide a synonym recognition method, a synonym recognition device, computer equipment and a storage medium, which are used for solving the problems that the effective synonym recognition is difficult to realize under the condition of less corpus in the prior art and the synonym recognition accuracy is low when text similarity is directly calculated in the vertical field.
In order to solve the above technical problems, the embodiment of the present application provides a synonym recognition method, which adopts the following technical scheme:
a synonym identification method comprising the steps of:
carrying out named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
sequentially searching in at least one given data search engine by taking the elements in the first character string set and the entity set as keywords, and generating a second character string set according to a search result, wherein the second character string set comprises recognized character strings;
generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
Generating a third character string set according to the recognized character strings in the first character string set and the second character string set, carrying out supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting the character strings to be marked of the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on marking results after marking;
and updating the third character string set according to the character strings which are recognized and marked in a supplementing manner, and repeating the steps of recognizing and marking the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
In order to solve the technical problems, the embodiment of the application also provides a synonym recognition device, which adopts the following technical scheme:
a synonym identification device comprising:
the data acquisition module is used for carrying out named entity recognition on texts to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
The search module is used for sequentially searching in at least one given data search engine by taking the elements in the first character string set and the entity set as keywords, and generating a second character string set according to a search result, wherein the second character string set comprises the identified character strings;
the first recognition module is used for generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym recognition of at least one character string;
the labeling module is used for generating a third character string set according to the character strings which are recognized in the first character string set and the second character string set, carrying out supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting the character strings to be labeled of the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on a labeling result after labeling;
and the control module is used for updating the third character string set according to the character strings which are recognized and marked in a supplementing way, and then enabling the marking module to repeatedly execute the supplementing recognition and marking on the character strings which are not recognized in the first character string set after the marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the synonym identification method as described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the synonym identification method as described above.
Compared with the prior art, the synonym identification method, the synonym identification device, the computer equipment and the storage medium provided by the embodiment of the application have the following main beneficial effects:
the recognition accuracy is improved through the transmissibility of synonyms, the influence of spoken language and written language differences is reduced, and meanwhile, the marked character strings are selected through an active learning mode, so that the recognition workload is greatly reduced.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, the drawings in the following description corresponding to some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a synonym identification method according to the present disclosure;
FIG. 3 is a schematic diagram of one embodiment of a synonym recognition device according to the present disclosure;
FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.
Description of the embodiments
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the synonym recognition method provided by the embodiment of the present application is generally executed by a server, and accordingly, the synonym recognition device is generally disposed in the server.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow chart of one embodiment of a synonym identification method according to the present disclosure is shown. The synonym recognition method comprises the following steps:
S201, carrying out named entity recognition on texts to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
s202, sequentially searching in at least one given data search engine by taking the elements in the first character string set and the entity set as keywords, and generating a second character string set according to a search result, wherein the second character string set comprises the identified character strings;
s203, generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
s204, generating a third character string set according to the character strings which are recognized in the first character string set and the second character string set, carrying out supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, carrying out character string extraction to be marked on the first character string set after supplementary recognition, and linking the extracted character strings to the entities in the entity set based on marking results after marking;
S205, updating the third character string set according to the character strings which are recognized and marked in a supplementing mode, and repeating the steps of recognizing and marking the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
The above steps are explained below.
The embodiment of the application is applied to the vertical field with less corpus, and the vertical field is a disease field which is only focused on a certain part of a certain industry, such as the medical industry.
For step S201, in this embodiment, the text to be subjected to synonym recognition refers to a text including a plurality of character strings to be subjected to synonym recognition, for example, in a QA system, after receiving a text submitted by a user (such as a question or an operation instruction, etc.), the QA system needs to recognize the intention of the question to generate a targeted reply text and feed the targeted reply text back to the user, where the text submitted by the user is the text to be subjected to synonym recognition, and the plurality of character strings to be subjected to synonym recognition can be extracted from the text to be subjected to synonym recognition by means of named entity recognition, so as to obtain a first character string set.
The Knowledge Base (knowledgebase) is a structured, easy-to-operate, easy-to-use, comprehensive and organized Knowledge cluster in Knowledge engineering, and is a set of interconnected Knowledge pieces stored, organized, managed and used in a computer memory by adopting a certain (or several) Knowledge representation mode according to the need of solving a problem in a certain (or certain) field. The knowledge representation mode of the knowledge base is generally presented in terms of entities, that is, the knowledge base includes a plurality of entities, for the vertical field, the coverage area of the entities of the knowledge base is smaller, and when reading a plurality of entities in the preset knowledge base to form an entity set, all the entities can be read to form the entity set, or part of the entities can be read to form the entity set.
In some embodiments, for a case where the reading part of the entities forms an entity set, the step of reading a plurality of entities in the preset knowledge base forms an entity set includes: judging whether the character strings in the first character string set can be directly identified in the preset knowledge base, if so, linking the directly identified character strings to corresponding entities in the preset knowledge base, acquiring entities directly related to the linked entities from the preset knowledge base, and generating the entity set according to the linked entities and the directly related entities. The accuracy of identification can be improved by selecting the linked entity and the entity set directly related to the linked entity, and the method is also beneficial to reducing the workload of subsequent retrieval and similarity calculation and improving the efficiency of synonym identification.
In a further embodiment, the step of reading a plurality of entities in the preset knowledge base to form an entity set further comprises: obtaining a preset association level, obtaining other entities which are associated with the directly associated entity except the linked entity from the preset knowledge base according to the association level, taking the other entities as indirectly associated entities of the linked entity, and adding the indirectly associated entities to the entity set. By adopting the step, the effective coverage rate of the entity set can be improved, the entity set contains synonyms of each character string in the first character string set as much as possible, entities except for the directly related entities can be related to other entities except for the linked entities, more related entities can be obtained through indirectly related entities, namely, through setting the association level, the entities with multi-level indirect association can be obtained based on the directly related entities and added to the entity set, for example, the association level is set to be two levels.
For step S202, in this embodiment, when searching is performed for each keyword, the search result of each keyword is obtained to extract the character strings respectively, and then the character strings are combined to generate the second character string set. Wherein, when extracting the character string from the search result, a plurality of items in front of the search result items can be extracted to extract the character string. For example, the character strings in the first character string set and the entities in the entity set are used as keywords to search the content from the commercial search engine, and Top-K webpage contents are selected to extract the character strings.
In this embodiment, the elements in the second string set may include strings in the first string set and entities in the entity set, and may further include strings other than the elements in the first string set and the entity set, where such strings are referred to as unregistered strings, and these unregistered strings include strings in which synonyms exist in the entity set, that is, the second string set includes identified strings.
For step S203, the embodiment performs initial synonym pair mining on the first string set, the entity set and the second string set, and in this scheme, the method adopts a Pattern-matching mode (Pattern-Based) to mine synonym pairs, such as "a, also called B", "a, and alias B", in combination with the search result, and then a and B may form synonym pairs.
The synonym pair may be formed by two elements in the same set in the first string set, the entity set and the second string set, or may be formed by two elements in different sets, that is, the synonym pair generated in this embodiment may be a synonym pair formed by a string in the first string set and an entity in the entity set, or may be a synonym pair formed by two strings in the first string set, or a synonym pair formed by a string in the first string set and an unregistered phrase, based on these synonyms, it may be realized that a part of strings in the first string set are directly linked to entities in the entity set, so as to complete recognition, or may be realized that a part of strings in the first string set are linked to entities in the entity set through one step of skipping of strings in the second string set, for example, the synonym pair formed by the first string C1 and the entity string C2 in the entity set, and the entity S2 and the entity S pair may be linked, so as to complete recognition of the entity S.
For step S204, the elements in the third string set in this embodiment are all strings for which synonym recognition has been completed, that is, the elements in the third string set can be matched with corresponding synonyms in the entity set.
In some embodiments, the step of complementarily identifying the character strings in the first character string set based on the first character string set, the third character string set, and the entity set includes: and calculating the similarity between the unrecognized character strings in the first character string set and the elements in the third character string set and the entity set, and when the similarity is larger than a first preset threshold value, linking the corresponding character strings in the first character string set to the entities in the entity set to complete the supplementary recognition. Specifically, for each unrecognized string, the similarity between the unrecognized string and all the entities and the recognized string can be calculated by using methods such as Jaccard distance and edit distance, if the similarity between the unrecognized string and an entity or the recognized string exceeds a first preset threshold, the entity or the recognized string is considered to be a synonym of the unrecognized string, and the unrecognized string can be directly or indirectly linked to an entity in the entity set, so that the first preset threshold is set to be a larger value in order to ensure accuracy.
Wherein when the similarity is greater than a first preset threshold, the step of linking the corresponding character string in the first character string set to the entity in the entity set includes: judging whether a plurality of similarities are larger than the first preset threshold, if so, selecting the maximum similarity, and linking the corresponding character string in the first character string set to the entity in the entity set based on the maximum similarity. In the step, after comparing the similarity with the first preset threshold, if a plurality of similarities exceed the first preset threshold, the highest similarity is selected to identify the synonym, so that higher identification accuracy can be obtained.
In some embodiments, the step of extracting the character string to be annotated for the first character string set after the supplementary recognition, and linking the extracted character string to the entity in the entity set based on the annotation result after the annotation includes: selecting any unidentified character string in the first character string set after supplementary recognition as a current character string, calculating the similarity between the current character string and other unidentified character strings in the first character string set and elements in the entity set, outputting the current character string to a target receiving end for marking if the number of the similarity exceeding a second preset threshold reaches a preset value, and linking the current character string to the entities in the entity set based on a marking result; wherein the second preset threshold is less than the first preset threshold. After the supplementary recognition is performed, the recognition difficulty of the remaining unrecognized character strings of the character strings in the first character string set is higher, at this time, the similarity between the character strings and other unrecognized character strings in the first character string set and elements in the entity set is calculated, a plurality of similarities meeting the conditions are obtained based on a second preset threshold value smaller than the first preset threshold value, each similarity can be linked to one character string or entity, and if the number of the linkable character strings reaches a certain preset value, the recognition difficulty is higher, and auxiliary labeling is needed to perform synonym recognition. The character strings to be marked are screened in the active learning process, the character strings marked with priority can be rapidly obtained, and marking and identifying efficiency is improved.
In some embodiments, prior to the step of computing the similarity of the unidentified string in the first string set to the elements in the third string set and the entity set, the method further comprises: generating a plurality of character sub-strings based on all the character strings in the first character string set and the second character string set, acquiring the frequency of the common character sub-strings among all the character strings based on the character sub-strings, carrying out weight marking on the common character sub-strings with the frequency reaching a preset value, and regenerating all the character strings in the first character string set and the second character string set according to the marked weights to obtain a new first character string set and a new second character string set. And then calculating the similarity based on the new first character string set and the new second character string set, and increasing the accuracy of similarity calculation by labeling part of high-frequency public character sub-strings. The marked weight value may be 0 or 1, for example, in the synonym recognition of the profession, the "work" is used as a high-frequency public character sub-string, the influence on whether the character string of the first character string set and the entity in the entity set are synonyms is small, the weight of the character string set may be marked as 0, for example, the "UI design work" may be regenerated into the "UI design", the "programmer work" may be regenerated into the "programmer", and the like, which is equivalent to the "UI design work" and the "UI design", the "programmer work" and the "programmer" and the like are all synonyms.
For step S205, for the character string identified in the previous step, adding the character string to the identified third character string set, repeating the supplementary identification, and simultaneously obtaining more character strings with high identification difficulty for marking, and repeating the steps until the recall rate reaches the required value, for example, if 90% of the character strings of the first character string can find the corresponding entity in the entity set as synonym, the recall rate is 90%, and if the preset value of the recall rate is 90%, stopping identifying and outputting the identification result when the value is reached.
In some embodiments, after the step of until the recognition recall of the strings in the first string set reaches a preset value, the method further comprises: and updating the preset knowledge base according to the recognition results of the character strings in the first character string set and the second character string set so as to add the recognized unregistered character strings into the preset knowledge base to generate new entities. The recall rate and the accuracy rate of the subsequent synonym recognition can be improved based on the updated knowledge base.
According to the synonym recognition method provided by the application, synonym recognition is performed by an active learning mode, public character substrings with larger influence are screened out, a small amount of labels are performed to improve the accuracy of text similarity calculation, the transmissibility of the synonym is utilized, the labeling workload is greatly reduced, meanwhile, higher accuracy is ensured, and the influence of the difference between spoken text and written text on synonym recognition is reduced.
It should be emphasized that, to further ensure the privacy and security of the information, the private information in the text to be identified by the synonym may also be stored in a node of a blockchain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a synonym recognition device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device is particularly applicable to various electronic devices.
As shown in fig. 3, the synonym recognition device according to the present embodiment includes: a data acquisition module 301, a search module 302, a first identification module 303, a labeling module 304, and a control module 305.
The data obtaining module 301 is configured to perform named entity recognition on a text to be subjected to synonym recognition to obtain a first string set, and read a plurality of entities in a preset knowledge base to form an entity set; the search module 302 is configured to sequentially search in at least one given data search engine using the first string set and the elements in the entity set as keywords, and generate a second string set according to a search result, where the second string set includes the identified strings; the first recognition module 303 is configured to generate a plurality of synonym pairs based on the first string set, the entity set, and the second string set, and link at least one string in the first string set to an entity in the entity set according to the synonym pairs, so as to complete synonym recognition of at least one string; the labeling module 304 is configured to generate a third string set according to the identified strings in the first string set and the second string set, perform supplementary recognition on the strings in the first string set based on the first string set, the third string set and the entity set, perform extraction of the strings to be labeled on the first string set after the supplementary recognition, and link the extracted strings to entities in the entity set based on a labeling result after labeling; the control module 305 is configured to update the third string set according to the character strings that are recognized and labeled in a complementary manner, and then make the labeling module 304 repeatedly perform the complementary recognition and labeling on the character strings that are not recognized in the first string set after the labeling until the recognition recall rate of the character strings in the first string set meets a preset condition.
In this embodiment, the text to be subjected to synonym recognition refers to a text including a plurality of character strings to be subjected to synonym recognition, and the data acquisition module 301 may extract the plurality of character strings to be subjected to synonym recognition from the text to be subjected to synonym recognition by using a named entity recognition method, so as to obtain a first character string set.
In some embodiments, for the case that the reading part of the entities form the entity set, the data obtaining module 301 is specifically configured to determine whether the character strings in the first character string set can be directly identified in the preset knowledge base when the reading part of the entities form the entity set, if so, link the directly identified character strings to the corresponding entities in the preset knowledge base, obtain the entity directly associated with the linked entity from the preset knowledge base, and generate the entity set according to the linked entity and the directly associated entity. The accuracy of identification can be improved by selecting the linked entity and the entity set directly related to the linked entity, and the method is also beneficial to reducing the workload of subsequent retrieval and similarity calculation and improving the efficiency of synonym identification.
In a further embodiment, when the data obtaining module 301 reads a plurality of entities in the preset knowledge base to form an entity set, the method further includes: obtaining a preset association level, obtaining other entities which are associated with the directly associated entity except the linked entity from the preset knowledge base according to the association level, taking the other entities as indirectly associated entities of the linked entity, and adding the indirectly associated entities to the entity set. The effective coverage of the entities of the entity set can be improved, and the method embodiment can be specifically referred to and is not expanded herein.
When the search module 302 in this embodiment searches each keyword, a search result of each keyword is obtained to extract a character string respectively, and then the character strings are combined to generate a second character string set. Wherein the search module 302 may extract a number of entries preceding the search result entry to perform string extraction when extracting a string from the search result.
In this embodiment, the elements in the second string set may include strings in the first string set and entities in the entity set, and may further include strings other than the elements in the first string set and the entity set, where such strings are referred to as unregistered strings, and these unregistered strings include strings in which synonyms exist in the entity set, that is, the second string set includes identified strings.
For the first string set, the entity set, and the second string set, the first recognition module 303 uses a Pattern-matching method (Pattern-Based) to mine the synonym pair in combination with the search result, for example, "a", also called "B", "a, and" alias B ", where a and B may form the synonym pair. Reference should be made in particular to the above-described method embodiments, which are not to be construed as being limiting.
In this embodiment, the elements in the third string set are all strings for which synonym recognition has been completed, that is, the elements in the third string set can be matched with corresponding synonyms in the entity set.
In some embodiments, the labeling module 304 is specifically configured to, when performing the complementary recognition on the strings in the first string set based on the first string set, the third string set, and the entity set: and calculating the similarity between the unrecognized character strings in the first character string set and the elements in the third character string set and the entity set, and when the similarity is larger than a first preset threshold value, linking the corresponding character strings in the first character string set to the entities in the entity set to complete the supplementary recognition. Reference should be made in particular to the above-described method embodiments, which are not to be construed as being limiting.
When the similarity is greater than a first preset threshold, the labeling module 304 links the corresponding character string in the first character string set to the entity in the entity set, where the entity is specifically configured to: judging whether a plurality of similarities are larger than the first preset threshold, if so, selecting the maximum similarity, and linking the corresponding character string in the first character string set to the entity in the entity set based on the maximum similarity. Thus, higher recognition accuracy can be obtained.
In some embodiments, the labeling module 304 performs a string extraction to be labeled on the first string set after the supplementary recognition, and is specifically configured to, when linking the extracted string to an entity in the entity set based on a labeling result after labeling: selecting any unidentified character string in the first character string set after supplementary recognition as a current character string, calculating the similarity between the current character string and other unidentified character strings in the first character string set and elements in the entity set, outputting the current character string to a target receiving end for marking if the number of the similarity exceeding a second preset threshold reaches a preset value, and linking the current character string to the entities in the entity set based on a marking result; wherein the second preset threshold is less than the first preset threshold. Reference should be made in particular to the above-described method embodiments, which are not to be construed as being limiting.
In some embodiments, before the step of calculating the similarity of the unidentified string in the first string set to the elements in the third string set and the entity set, the labeling module 304 is further configured to: generating a plurality of character sub-strings based on all the character strings in the first character string set and the second character string set, acquiring the frequency of the common character sub-strings among all the character strings based on the character sub-strings, carrying out weight marking on the common character sub-strings with the frequency reaching a preset value, and regenerating all the character strings in the first character string set and the second character string set according to the marked weights to obtain a new first character string set and a new second character string set. Reference should be made in particular to the above-described method embodiments, which are not to be construed as being limiting.
In this embodiment, the control module 305 adds the identified string to the identified third string set, and repeatedly performs supplementary identification through the labeling module 304, and obtains more strings with high identification difficulty to label, so that the process is circulated until the recall rate reaches a required value, for example, 90% of strings of the first string can find out corresponding entities in the entity set as synonyms, the recall rate is 90%, and if the preset value of the recall rate is 90%, the identification is stopped and the identification result is output when the preset value of the recall rate reaches the required value.
In some embodiments, the control module 305 is further configured to update the preset knowledge base according to the recognition results of the strings in the first string set and the second string set, so as to add the recognized unregistered strings to the preset knowledge base to generate a new entity. The recall rate and the accuracy rate of the subsequent synonym recognition can be improved based on the updated knowledge base.
According to the synonym recognition device provided by the application, synonym recognition is performed by an active learning mode, public character substrings with larger influence are screened out, a small amount of labels are performed to improve the accuracy of text similarity calculation, the transmissibility of the synonym is utilized, the labeling workload is greatly reduced, meanwhile, higher accuracy is ensured, and the influence of the difference between spoken text and written text on synonym recognition is reduced.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment. The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are communicatively connected to each other through a system bus, where computer readable instructions are stored in the memory 41, and the processor 42 implements the steps of the synonym identification method described in the above method embodiments when executing the computer readable instructions, and has advantages corresponding to the synonym identification method described above, which is not expanded herein.
It is noted that only a computer device 4 having a memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
In the present embodiment, the memory 41 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions corresponding to the synonym recognition method described above. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, for example, execute computer readable instructions corresponding to the synonym identification method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
The present application also provides another embodiment, namely, a computer-readable storage medium, where computer-readable instructions are stored, where the computer-readable instructions are executable by at least one processor, so that the at least one processor performs the steps of the synonym identification method as described above, and has the advantages corresponding to the synonym identification method described above, and are not expanded herein.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (9)

1. A method of synonym identification comprising the steps of:
carrying out named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
sequentially searching in at least one given data search engine by taking the elements in the first character string set and the entity set as keywords, and generating a second character string set according to a search result, wherein the second character string set comprises recognized character strings;
Generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
generating a third character string set according to the character strings which are recognized in the first character string set and the second character string set, calculating the similarity between the unrecognized character strings in the first character string set and the elements in the third character string set and the entity set, when the similarity is larger than a first preset threshold value, linking the corresponding character strings in the first character string set to the entity in the entity set, completing supplementary recognition, selecting any unrecognized character string in the first character string set after the supplementary recognition as a current character string, calculating the similarity between the current character string and other unrecognized character strings in the first character string set and the elements in the entity set, outputting the current character string to a target receiving end for marking if the number of the similarities exceeding a second preset threshold value reaches a preset value, and linking the current character string to the entity in the entity set based on a marking result; wherein the second preset threshold is less than the first preset threshold;
And updating the third character string set according to the character strings which are recognized and marked in a supplementing manner, and repeating the steps of recognizing and marking the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
2. The synonym identification method of claim 1, wherein prior to the step of calculating the similarity of unrecognized strings in the first set of strings to elements in the third set of strings and the set of entities, the method further comprises:
generating a plurality of character sub-strings based on all the character strings in the first character string set and the second character string set, acquiring the frequency of the common character sub-strings among all the character strings based on the character sub-strings, carrying out weight marking on the common character sub-strings with the frequency reaching a preset value, and regenerating all the character strings in the first character string set and the second character string set according to the marked weights to obtain a new first character string set and a new second character string set.
3. The synonym identification method of claim 1 or claim 2, wherein when there is a similarity greater than a first preset threshold, the step of linking the corresponding string in the first string set to an entity in the entity set comprises:
Judging whether a plurality of similarities are larger than the first preset threshold, if so, selecting the maximum similarity, and linking the corresponding character string in the first character string set to the entity in the entity set based on the maximum similarity.
4. The synonym recognition method of claim 1 or 2, wherein after the step of until the recognition recall of the strings in the first set of strings reaches a preset value, the method further comprises:
and updating the preset knowledge base according to the recognition results of the character strings in the first character string set and the second character string set so as to add the recognized unregistered character strings into the preset knowledge base to generate new entities.
5. The synonym identification method of claim 1 or 2, wherein the step of reading a number of entities in a pre-set knowledge base to form a set of entities comprises:
judging whether the character strings in the first character string set can be directly identified in the preset knowledge base, if so, linking the directly identified character strings to corresponding entities in the preset knowledge base, acquiring entities directly related to the linked entities from the preset knowledge base, and generating the entity set according to the linked entities and the directly related entities.
6. The synonym identification method of claim 5, wherein the step of reading a plurality of entities in a pre-set knowledge base to form a set of entities further comprises:
obtaining a preset association level, obtaining other entities which are associated with the directly associated entity except the linked entity from the preset knowledge base according to the association level, taking the other entities as indirectly associated entities of the linked entity, and adding the indirectly associated entities to the entity set.
7. A synonym identification device, comprising:
the data acquisition module is used for carrying out named entity recognition on texts to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
the search module is used for sequentially searching in at least one given data search engine by taking the elements in the first character string set and the entity set as keywords, and generating a second character string set according to a search result, wherein the second character string set comprises the identified character strings;
the first recognition module is used for generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym recognition of at least one character string;
The labeling module is used for generating a third character string set according to the character strings which are recognized in the first character string set and the second character string set, calculating the similarity between the unrecognized character strings in the first character string set and the elements in the third character string set and the entity set, when the similarity is larger than a first preset threshold value, linking the corresponding character strings in the first character string set to the entity in the entity set, completing the supplementary recognition, selecting any unrecognized character string in the first character string after the supplementary recognition as a current character string, calculating the similarity between the current character string and other unrecognized character strings in the first character string set and the elements in the entity set, outputting the current character string to a target receiving end for labeling if the number of the similarities exceeding a second preset threshold value reaches a preset value, and linking the current character string to the entity in the entity set based on a labeling result; wherein the second preset threshold is less than the first preset threshold;
and the control module is used for updating the third character string set according to the character strings which are recognized and marked in a supplementing way, and then enabling the marking module to repeatedly execute the supplementing recognition and marking on the character strings which are not recognized in the first character string set after the marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
8. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the synonym identification method as claimed in any one of claims 1 to 6.
9. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the synonym identification method as claimed in any one of claims 1 to 6.
CN202110479989.9A 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium Active CN113051900B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110479989.9A CN113051900B (en) 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110479989.9A CN113051900B (en) 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113051900A CN113051900A (en) 2021-06-29
CN113051900B true CN113051900B (en) 2023-08-22

Family

ID=76517864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110479989.9A Active CN113051900B (en) 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113051900B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720520B (en) * 2023-08-07 2023-11-03 烟台云朵软件有限公司 Text data-oriented alias entity rapid identification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN110633464A (en) * 2018-06-22 2019-12-31 北京京东尚科信息技术有限公司 Semantic recognition method, device, medium and electronic equipment
CN110825827A (en) * 2019-11-13 2020-02-21 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN110633464A (en) * 2018-06-22 2019-12-31 北京京东尚科信息技术有限公司 Semantic recognition method, device, medium and electronic equipment
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora
CN110825827A (en) * 2019-11-13 2020-02-21 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device

Also Published As

Publication number Publication date
CN113051900A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
CN110427453B (en) Data similarity calculation method, device, computer equipment and storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN113987125A (en) Text structured information extraction method based on neural network and related equipment thereof
CN113407785A (en) Data processing method and system based on distributed storage system
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
CN112446209A (en) Method, equipment and device for setting intention label and storage medium
CN112395391A (en) Concept graph construction method and device, computer equipment and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN113051900B (en) Synonym recognition method, synonym recognition device, computer equipment and storage medium
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN117312535A (en) Method, device, equipment and medium for processing problem data based on artificial intelligence
CN111639164A (en) Question-answer matching method and device of question-answer system, computer equipment and storage medium
CN112182157A (en) Training method of online sequence labeling model, online labeling method and related equipment
CN114742058B (en) Named entity extraction method, named entity extraction device, computer equipment and storage medium
CN115730603A (en) Information extraction method, device, equipment and storage medium based on artificial intelligence
CN114637831A (en) Data query method based on semantic analysis and related equipment thereof
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN112949320A (en) Sequence labeling method, device, equipment and medium based on conditional random field
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN113688268B (en) Picture information extraction method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant