CN113051900A - Synonym recognition method and device, computer equipment and storage medium - Google Patents

Synonym recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113051900A
CN113051900A CN202110479989.9A CN202110479989A CN113051900A CN 113051900 A CN113051900 A CN 113051900A CN 202110479989 A CN202110479989 A CN 202110479989A CN 113051900 A CN113051900 A CN 113051900A
Authority
CN
China
Prior art keywords
character string
string set
character
entity
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110479989.9A
Other languages
Chinese (zh)
Other versions
CN113051900B (en
Inventor
陈岳峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110479989.9A priority Critical patent/CN113051900B/en
Publication of CN113051900A publication Critical patent/CN113051900A/en
Application granted granted Critical
Publication of CN113051900B publication Critical patent/CN113051900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method comprises the steps of searching according to a first character string set obtained from a text to be identified by synonyms and elements of an entity set formed according to a preset knowledge base, and generating a second character string set containing identified character strings; generating a number of synonym pairs based on the three sets to link at least one string in the first set of strings to the entity; generating a third character string set according to the recognized character strings; performing supplementary identification and labeling on the character strings in the first character string set based on the first character string set, the third character string set and the entity set; and after the third character string set is updated according to the character strings subjected to supplementary identification and marking, the supplementary identification and marking are repeated on the first character string set until the identification recall rate meets the preset condition. The application also relates to a blockchain technology, and the private information in the text can be stored in the blockchain. The application improves the identification accuracy rate through the transferability of synonyms based on active learning.

Description

Synonym recognition method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a synonym identification method, apparatus, computer device, and storage medium.
Background
Synonym recognition is a relatively important and fundamental problem in NLP (Natural Language Processing), and has important applications in knowledge map and Question Answering System (QA), for example, in Question Answering System, synonym recognition greatly affects Question Answering accuracy and coverage. The synonym identification process specifically finds out the synonymous Entity of the Entity to be identified from the knowledge base, assuming that the Entity to be identified is named as a menton and the Entity in the knowledge base is an Entity, the main solution for identifying the synonymous Entity of the menton in the knowledge base at present generally comprises:
(1) the method is simple and easy to obtain with high accuracy by means of the existing knowledge base edited manually, such as 'forest expansion edition of great synonyms of Hagong', HowNet and the like, and has the defect of low coverage rate in a specific vertical field.
(2) By means of context correlation of texts, such as unsupervised Word2Vec and weakly supervised DPE, similarity between each entity to be recognized and entities in a knowledge base is calculated, and the entities in the similarity Top-K knowledge bases are obtained through sequencing.
(3) And judging whether the synonym is the synonym or not by directly calculating the text similarity of the Mention and the Entity according to the text similarity, wherein the method is simple in calculation, does not need large-scale corpora, and has the defect that a plurality of wrong synonyms can be mined.
There are some products that are synonym recognition for vertical domain, since domain restriction is difficult to obtain large-scale corpus, and Mention lacks useful context information in corpus in many scenarios, such as user query of QA is generally short, and Mention has less context information, it is difficult to realize effective synonym recognition in solutions (1) and (2); in the solution (3), the calculation of the text similarity is limited by a screened threshold, the recall rate is low if the threshold is set to be higher, the accuracy is low if the threshold is set to be lower, and the prediction of synonyms is often wrong, for example, in the vertical fields of medical profession and the like, a phrase consisting of a plurality of words is mostly used as a part of a division and an Entity, the division is often spoken, and an Entity in a knowledge base is written, so that the calculation of the text similarity of the phrase is limited to be not accurate enough, the difference exists between the spoken language and the written language, and the directly calculated text similarity is not accurate, thereby affecting the final recognition accuracy.
Disclosure of Invention
An object of the embodiments of the present application is to provide a synonym identification method, apparatus, computer device, and storage medium, so as to solve the problems in the prior art that effective synonym identification is difficult to achieve under the condition of less linguistic data, and that the synonym identification accuracy is low when text similarity is directly calculated in the vertical field.
In order to solve the above technical problem, an embodiment of the present application provides a synonym identification method, which adopts the following technical solutions:
a synonym recognition method, comprising the steps of:
carrying out named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
taking the elements in the first character string set and the entity set as key words, sequentially searching in at least one given data search engine, and generating a second character string set according to a search result, wherein the second character string set comprises recognized character strings;
generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
generating a third character string set according to the recognized character strings in the first character string set and the second character string set, performing supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting character strings to be labeled from the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on a labeling result after the labeling;
and updating the third character string set according to the character strings subjected to supplementary recognition and marking, and repeating the steps of supplementary recognition and marking on the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
In order to solve the above technical problem, an embodiment of the present application further provides a synonym identification device, which adopts the following technical solutions:
a synonym recognition device, comprising:
the data acquisition module is used for carrying out named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set and reading a plurality of entities in a preset knowledge base to form an entity set;
a search module, configured to search sequentially in at least one given data search engine using the elements in the first character string set and the entity set as keywords, and generate a second character string set according to a search result, where the second character string set includes identified character strings;
the first identification module is used for generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
the labeling module is used for generating a third character string set according to the recognized character strings in the first character string set and the second character string set, performing supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting character strings to be labeled from the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on a labeling result after labeling;
and the control module is used for updating the third character string set according to the character strings subjected to supplementary recognition and marking, and then enabling the marking module to repeatedly execute supplementary recognition and marking on the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets a preset condition.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
a computer device comprising a memory having computer readable instructions stored therein and a processor that when executed implements the steps of a synonym recognition method as described above.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a synonym recognition method as described above.
Compared with the prior art, the synonym identification method, the synonym identification device, the computer equipment and the storage medium provided by the embodiment of the application have the following main beneficial effects:
the recognition accuracy rate is improved through the transferability of synonyms, the influence of difference of spoken language and written language is reduced, meanwhile, the marked character strings are selected through an active learning mode, and the recognition workload is greatly reduced.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for the description of the embodiments of the present application will be briefly described below, and the drawings in the following description correspond to some embodiments of the present application, and it will be obvious to those skilled in the art that other drawings can be obtained from the drawings without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a synonym identification method according to the present application;
FIG. 3 is a schematic diagram of a synonym identification device according to an embodiment of the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and in the claims of the present application or in the drawings described above, are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the synonym identification method provided in the embodiment of the present application is generally executed by a server, and accordingly, the synonym identification apparatus is generally disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram of one embodiment of a synonym identification method according to the present application is shown. The synonym identification method comprises the following steps:
s201, conducting named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
s202, taking the elements in the first character string set and the entity set as key words, sequentially searching in at least one given data search engine, and generating a second character string set according to a search result, wherein the second character string set comprises recognized character strings;
s203, generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
s204, generating a third character string set according to the recognized character strings in the first character string set and the second character string set, performing supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting character strings to be labeled of the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on a labeling result after the labeling;
s205, updating the third character string set according to the character strings subjected to supplementary recognition and marking, and repeating the steps of supplementary recognition and marking on the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
The above steps are explained in the following.
The embodiment of the application is applied to the vertical field with less corpus, and the vertical field refers to the field of diseases only focusing on a certain part of a certain industry, such as the medical industry.
For step S201, in this embodiment, the text to be subjected to synonym recognition is a text including a plurality of character strings to be subjected to synonym recognition, for example, in the QA system, after receiving a text submitted by a user (such as a question or an operation instruction, etc.), the QA system needs to identify an intention of the question to generate a targeted reply text to be fed back to the user, where the text submitted by the user is the text to be subjected to synonym recognition, and a plurality of character strings to be subjected to synonym recognition can be extracted from the text to be subjected to synonym recognition by means of named entity recognition, so as to obtain the first character string set.
The Knowledge Base (Knowledge Base) is a structured, easy-to-operate, easy-to-use, comprehensive and organized Knowledge cluster in Knowledge engineering, and is a set of interconnected Knowledge pieces which are stored, organized, managed and used in a computer memory by adopting a certain (or a plurality of) Knowledge representation modes according to the needs of solving problems in a certain (or certain) field. The knowledge representation mode of the knowledge base is generally presented in terms of entities, namely the knowledge base comprises a plurality of entities, and for the vertical field, the entity coverage of the knowledge base is small, and when a plurality of entities in a preset knowledge base are read to form an entity set, all the entities can be read to form the entity set, and part of the entities can be read to form the entity set.
In some embodiments, for the case that the reading part of the entities forms an entity set, the reading a number of entities in the preset knowledge base forms an entity set includes: and judging whether the character strings in the first character string set can be directly identified in the preset knowledge base or not, if so, linking the directly identified character strings to corresponding entities in the preset knowledge base, acquiring entities directly associated with the linked entities from the preset knowledge base, and generating the entity set according to the linked entities and the directly associated entities. The entity set is generated by selecting the linked entities and the directly related entities, so that the identification accuracy can be improved, the workload of subsequent retrieval and similarity calculation can be reduced, and the efficiency of synonym identification can be improved.
In a further embodiment, the step of reading a plurality of entities in the preset knowledge base to form an entity set further includes: acquiring a preset association series, acquiring other entities which are associated with the directly associated entities except the linked entities from the preset knowledge base according to the association series, taking the other entities as indirectly associated entities of the linked entities, and adding the indirectly associated entities to the entity set. By adopting the step, the entity effective coverage rate of the entity set can be improved, the entity set can contain synonyms of all character strings in the first character string set as much as possible, the directly associated entities can be associated to other entities except the linked entities, and more associated entities can be obtained through the indirectly associated entities, namely, by setting the association series, the entities which are indirectly associated in multiple levels can be obtained based on the directly associated entities and added to the entity set, for example, the association series is set to be two levels.
For step S202, in this embodiment, when each keyword is searched separately, the search result of each keyword is obtained to extract a character string, and then the character strings are combined to generate a second character string set. When extracting the character string from the search result, a plurality of items in front of the search result items can be extracted to extract the character string. For example, the character strings in the first character string set and the entities in the entity set are respectively used as keywords to search contents from a commercial search engine, and Top-K webpage contents are selected to extract the character strings.
In this embodiment, the elements in the second character string set may include character strings in the first character string set and entities in the entity set, and may further include character strings other than the elements in the first character string set and the entity set, such character strings are called unregistered character strings, and these unregistered character strings include character strings in which synonyms exist in the entity set, that is, the second character string set includes recognized character strings.
For step S203, in this embodiment, initial synonym pairs are mined from the first character string set, the entity set, and the second character string set, in this scheme, the method uses a Pattern matching method (Pattern-Based) to combine with the search result to mine synonym pairs, for example, "a, also called B," and "a, alias B," and then a and B may constitute a synonym pair.
Wherein, the synonym pair may be composed of two elements located in the same set in the first character string set, the entity set, and the second character string set, or composed of two elements located in different sets, that is, the synonym pair generated in this embodiment may be a synonym pair composed of a character string in the first character string set and an entity in the entity set, or may be a synonym pair composed of two character strings in the first character string set, or a synonym pair composed of a character string in the first character string set and an unknown word, based on these synonyms, a part of character strings in the first character string set may be directly linked to an entity in the entity set, so as to complete recognition, or a part of character strings in the first character string set may be further jumped to an entity in the entity set through a character string in the second character string set so as to realize recognition, for example, the string C1 in the first string set and the string C2 in the second string set form a synonym pair, and the string C2 and the entity S in the entity set form a synonym pair, the string C1 may be linked to the entity S, i.e., the recognition is completed.
For step S204, in this embodiment, all the elements in the third character string set are character strings that have completed synonym recognition, that is, all the elements therein can be matched to corresponding synonyms in the entity set.
In some embodiments, the step of performing the supplemental identification of the character strings in the first set of character strings based on the first set of character strings, the third set of character strings, and the set of entities comprises: and calculating the similarity between the unrecognized character strings in the first character string set and the elements in the third character string set and the entity set, and linking the corresponding character strings in the first character string set to the entities in the entity set when the similarity is greater than a first preset threshold value, so as to complete supplementary recognition. Specifically, for each unrecognized character string, similarity between each unrecognized character string and all entities and between each unrecognized character string and the recognized character string can be calculated by using methods such as a Jaccard distance and an edit distance, if the similarity between the unrecognized character string and one entity or the recognized character string exceeds a first preset threshold, the entity or the recognized character string is considered as a synonym of the unrecognized character string, the unrecognized character string can be directly or indirectly linked to entities in an entity set, and in order to ensure accuracy, the first preset threshold is set to be a larger value.
Wherein, when the similarity is greater than a first preset threshold, the step of linking the corresponding character string in the first character string set to the entity in the entity set includes: and judging whether a plurality of similarity degrees are larger than the first preset threshold value, if so, selecting the maximum similarity degree, and linking the corresponding character strings in the first character string set to the entity in the entity set based on the maximum similarity degree. In this step, after comparing the similarity with the first preset threshold, if a plurality of similarities exceed the first preset threshold, the synonym recognition is performed by selecting the highest similarity, so that a higher recognition accuracy can be obtained.
In some embodiments, the step of extracting the character string to be labeled from the complementarily identified first character string set, and linking the extracted character string to the entity in the entity set based on the labeling result after labeling includes: selecting any unidentified character string in the first character string set after supplementary identification as a current character string, calculating the similarity between the current character string and other unidentified character strings in the first character string set and elements in the entity set, if the number of the similarity exceeding a second preset threshold reaches a preset value, outputting the current character string to a target receiving end for labeling, and linking the current character string to the entities in the entity set based on a labeling result; wherein the second preset threshold is smaller than the first preset threshold. After the supplementary recognition is carried out, the recognition difficulty of the remaining unidentified character strings in the first character string set is higher, at this time, the similarity between the remaining unidentified character strings in the first character string set and the elements in the entity set is calculated, then, based on a second preset threshold smaller than the first preset threshold, a plurality of similarity meeting conditions is obtained, based on each similarity, one character string or entity can be linked, if the number of the linkable character strings reaches a certain preset value, the recognition difficulty is higher, and auxiliary labeling is needed to carry out synonym recognition. The character strings to be marked are screened in the active learning process, so that the character strings marked preferentially can be quickly acquired, and the marking and identifying efficiency is improved.
In some embodiments, prior to the step of calculating the similarity of the unrecognized character strings in the first set of character strings to the elements in the third set of character strings and the set of entities, the method further comprises: generating a plurality of character substrings based on all the character strings in the first character string set and the second character string set, acquiring the frequency of common character substrings among all the character strings based on the character substrings, carrying out weight marking on the common character substrings with the frequency reaching a preset value, and regenerating all the character strings in the first character string set and the second character string set according to the marked weights to obtain a new first character string set and a new second character string set. And subsequently, similarity is calculated based on the new first character string set and the new second character string set, and the accuracy of similarity calculation is improved by marking part of high-frequency common character sub strings. The labeled weight value can be 0 or 1, for example, in the recognition of synonyms in career category, "work" is used as a high-frequency common character sub-string, the influence on whether the character strings of the first character string set and the entities in the entity set are synonyms is small, the weight can be labeled as 0, for example, "UI design work" can be regenerated into "UI design", "programmer work" can be regenerated into "programmer" and the like, which is equivalent to "UI design work" and "UI design", "programmer work" and "programmer" and the like which are synonyms.
For step S205, adding the character string identified in the previous step into the identified third character string set, repeating the supplementary identification, and meanwhile, obtaining more character strings with high identification difficulty for labeling, and repeating this process until the recall ratio reaches a required value, for example, if 90% of the character strings of the first character string can find the corresponding entity in the entity set as a synonym, the recall ratio is 90%, and if the preset value of the recall ratio is 90%, stopping the identification and outputting the identification result when the value is reached.
In some embodiments, after the step of until the recognition recall rate of the character strings in the first character string set reaches a preset value, the method further comprises: and updating the preset knowledge base according to the recognition results of the character strings in the first character string set and the second character string set so as to add the recognized unregistered character strings into the preset knowledge base to generate a new entity. The recall rate and accuracy of subsequent synonym identification may be improved based on the updated knowledge base.
According to the synonym recognition method, synonym recognition is carried out in an active learning-based mode, public character substrings with large influence are screened out, the accuracy of text similarity calculation is improved through a small amount of labels, the transferability of synonyms is utilized, the workload of labeling is greatly reduced, meanwhile, high accuracy is guaranteed, and the influence of differences of spoken texts and written texts on synonym recognition is reduced.
It is emphasized that, in order to further ensure the privacy and security of the information, the private information in the text to be identified by the synonym may also be stored in a node of a blockchain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a synonym recognition apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 3, the synonym recognition device according to this embodiment includes: a data acquisition module 301, a search module 302, a first identification module 303, a labeling module 304, and a control module 305.
The data acquisition module 301 is configured to perform named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set, and read a plurality of entities in a preset knowledge base to form an entity set; the search module 302 is configured to sequentially perform search in at least one given data search engine by using the elements in the first character string set and the entity set as keywords, and generate a second character string set according to a search result, where the second character string set includes recognized character strings; the first identification module 303 is configured to generate a plurality of synonym pairs based on the first character string set, the entity set, and the second character string set, and link at least one character string in the first character string set to an entity in the entity set according to the synonym pairs, thereby completing synonym identification of at least one character string; the labeling module 304 is configured to generate a third character string set according to the recognized character strings in the first character string set and the second character string set, perform supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set, and the entity set, extract a character string to be labeled for the first character string set after the supplementary recognition, and link the extracted character string to the entity in the entity set based on a labeling result after the labeling; the control module 305 is configured to update the third character string set according to the character strings subjected to supplementary recognition and tagging, and then enable the tagging module 304 to repeatedly perform supplementary recognition and tagging on the unrecognized character strings in the tagged first character string set until the recognition recall rate of the character strings in the first character string set meets a preset condition.
In this embodiment, the text to be subjected to synonym recognition refers to a text including a plurality of character strings to be subjected to synonym recognition, and the data acquisition module 301 may extract the plurality of character strings to be subjected to synonym recognition from the text to be subjected to synonym recognition in a named entity recognition manner, so as to obtain the first character string set.
In some embodiments, for the case that a part of the entities are read to form an entity set, when the data obtaining module 301 reads a plurality of entities in a preset knowledge base to form an entity set, it is specifically configured to determine whether the character strings in the first character string set can be directly identified in the preset knowledge base, if so, link the directly identified character strings to corresponding entities in the preset knowledge base, obtain entities directly associated with the linked entities from the preset knowledge base, and generate the entity set according to the linked entities and the directly associated entities. The entity set is generated by selecting the linked entities and the directly related entities, so that the identification accuracy can be improved, the workload of subsequent retrieval and similarity calculation can be reduced, and the efficiency of synonym identification can be improved.
In a further embodiment, when the data obtaining module 301 reads a plurality of entities in a preset knowledge base to form an entity set, the method further includes: acquiring a preset association series, acquiring other entities which are associated with the directly associated entities except the linked entities from the preset knowledge base according to the association series, taking the other entities as indirectly associated entities of the linked entities, and adding the indirectly associated entities to the entity set. Therefore, the effective coverage rate of the entity set can be improved, and reference may be made to the above method embodiments without being expanded herein.
In this embodiment, when the search module 302 searches for each keyword, the search result of each keyword is obtained to extract a character string, and then the character strings are combined to generate a second character string set. When the search module 302 extracts a character string from the search result, a number of items in front of the items of the search result may be extracted to extract the character string.
In this embodiment, the elements in the second character string set may include character strings in the first character string set and entities in the entity set, and may further include character strings other than the elements in the first character string set and the entity set, such character strings are called unregistered character strings, and these unregistered character strings include character strings in which synonyms exist in the entity set, that is, the second character string set includes recognized character strings.
For mining the initial synonym pair for the first string set, the entity set, and the second string set, in this embodiment, the first identification module 303 mines the synonym pair by combining the search result in a Pattern matching manner (Pattern-Based), for example, "a, also called B," and "a, alias B," and then a and B may constitute a synonym pair. Reference is made in particular to the above-described method embodiments, which are not to be construed as open ended herein.
In this embodiment, all the elements in the third character string set are character strings that have completed synonym recognition, that is, all the elements therein can be matched with corresponding synonyms in the entity set.
In some embodiments, the tagging module 304 is specifically configured to, when performing the supplementary recognition on the strings in the first string set based on the first string set, the third string set, and the entity set: and calculating the similarity between the unrecognized character strings in the first character string set and the elements in the third character string set and the entity set, and linking the corresponding character strings in the first character string set to the entities in the entity set when the similarity is greater than a first preset threshold value, so as to complete supplementary recognition. Reference is made in particular to the above-described method embodiments, which are not to be construed as open ended herein.
When the similarity is greater than a first preset threshold, the labeling module 304 links the corresponding character string in the first character string set to the entity in the entity set, specifically configured to: and judging whether a plurality of similarity degrees are larger than the first preset threshold value, if so, selecting the maximum similarity degree, and linking the corresponding character strings in the first character string set to the entity in the entity set based on the maximum similarity degree. This allows a higher recognition accuracy.
In some embodiments, the labeling module 304 performs to-be-labeled string extraction on the complementarily identified first string set, and when the extracted string is linked to an entity in the entity set based on a labeling result after labeling, is specifically configured to: selecting any unidentified character string in the first character string set after supplementary identification as a current character string, calculating the similarity between the current character string and other unidentified character strings in the first character string set and elements in the entity set, if the number of the similarity exceeding a second preset threshold reaches a preset value, outputting the current character string to a target receiving end for labeling, and linking the current character string to the entities in the entity set based on a labeling result; wherein the second preset threshold is smaller than the first preset threshold. Reference is made in particular to the above-described method embodiments, which are not to be construed as open ended herein.
In some embodiments, the tagging module 304, prior to the step of calculating the similarity of the unrecognized character string in the first set of character strings to the elements in the third set of character strings and the entity set, is further configured to: generating a plurality of character substrings based on all the character strings in the first character string set and the second character string set, acquiring the frequency of common character substrings among all the character strings based on the character substrings, carrying out weight marking on the common character substrings with the frequency reaching a preset value, and regenerating all the character strings in the first character string set and the second character string set according to the marked weights to obtain a new first character string set and a new second character string set. Reference is made in particular to the above-described method embodiments, which are not to be construed as open ended herein.
In this embodiment, the control module 305 adds the identified character string to the identified third character string set, and repeatedly performs the supplementary identification through the labeling module 304, and meanwhile obtains more character strings with high identification difficulty for labeling, and such a process is repeated until the recall rate reaches a required value, for example, if 90% of the character strings of the first character string can find the corresponding entity in the entity set as a synonym, the recall rate is 90%, and if the preset value of the recall rate is 90%, the identification is stopped and the identification result is output when the value is reached.
In some embodiments, the control module 305 is further configured to update the predetermined knowledge base according to the recognition results of the character strings in the first character string set and the second character string set, so as to add the recognized unregistered character string to the predetermined knowledge base to generate a new entity. The recall rate and accuracy of subsequent synonym identification may be improved based on the updated knowledge base.
The application provides a synonym recognition device carries out synonym recognition through the mode based on initiative study, through screening the great public character substring of influence to carry out a small amount of labels and improve the degree of accuracy that the text similarity calculated, and utilize the transferability of synonym, greatly reduce the work load of label, guaranteed higher rate of accuracy simultaneously, reduced the influence of spoken text and written text difference to synonym recognition.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment. The computer device 4 includes a memory 41, a processor 42, and a network interface 43, which are connected to each other through a system bus in a communication manner, where the memory 41 stores computer readable instructions, and the processor 42 implements the steps of the synonym identification method in the above method embodiment when executing the computer readable instructions, and has the beneficial effects corresponding to the above synonym identification method, which are not expanded herein.
It is noted that only computer device 4 having memory 41, processor 42, and network interface 43 is shown, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
In the present embodiment, the memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system and various types of application software installed on the computer device 4, such as computer readable instructions corresponding to the above synonym recognition method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, for example, execute computer readable instructions corresponding to the synonym identification method.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores computer-readable instructions, which are executable by at least one processor, so as to cause the at least one processor to perform the steps of the synonym recognition method, and have the corresponding advantages, which are not expanded herein, corresponding to the synonym recognition method.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A synonym recognition method, comprising the steps of:
carrying out named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set, and reading a plurality of entities in a preset knowledge base to form an entity set;
taking the elements in the first character string set and the entity set as key words, sequentially searching in at least one given data search engine, and generating a second character string set according to a search result, wherein the second character string set comprises recognized character strings;
generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
generating a third character string set according to the recognized character strings in the first character string set and the second character string set, performing supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting character strings to be labeled from the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on a labeling result after the labeling;
and updating the third character string set according to the character strings subjected to supplementary recognition and marking, and repeating the steps of supplementary recognition and marking on the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets the preset condition.
2. The method according to claim 1, wherein the step of performing supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set comprises:
calculating similarity between the unrecognized character strings in the first character string set and elements in the third character string set and the entity set, and linking the corresponding character strings in the first character string set to the entities in the entity set when the similarity is greater than a first preset threshold value, so as to complete supplementary recognition;
the step of extracting the character string to be labeled from the first character string set after the supplementary recognition, and linking the extracted character string to the entity in the entity set based on the labeling result after the labeling comprises:
selecting any unidentified character string in the first character string set after supplementary identification as a current character string, calculating the similarity between the current character string and other unidentified character strings in the first character string set and elements in the entity set, if the number of the similarity exceeding a second preset threshold reaches a preset value, outputting the current character string to a target receiving end for labeling, and linking the current character string to the entities in the entity set based on a labeling result; wherein the second preset threshold is smaller than the first preset threshold.
3. The synonym recognition method of claim 2, wherein prior to the step of calculating the similarity of the unrecognized character strings in the first set of character strings to the elements in the third set of character strings and the set of entities, the method further comprises:
generating a plurality of character substrings based on all the character strings in the first character string set and the second character string set, acquiring the frequency of common character substrings among all the character strings based on the character substrings, carrying out weight marking on the common character substrings with the frequency reaching a preset value, and regenerating all the character strings in the first character string set and the second character string set according to the marked weights to obtain a new first character string set and a new second character string set.
4. The method according to claim 2 or 3, wherein the step of linking the corresponding strings in the first string set to the entities in the entity set when the similarity is greater than a first preset threshold comprises:
and judging whether a plurality of similarity degrees are larger than the first preset threshold value, if so, selecting the maximum similarity degree, and linking the corresponding character strings in the first character string set to the entity in the entity set based on the maximum similarity degree.
5. The synonym recognition method according to claim 2 or 3, wherein after the step until the recognition recall rate of the character strings in the first character string set reaches a preset value, the method further comprises:
and updating the preset knowledge base according to the recognition results of the character strings in the first character string set and the second character string set so as to add the recognized unregistered character strings into the preset knowledge base to generate a new entity.
6. The method according to claim 2 or 3, wherein the step of reading a plurality of entities in a predetermined knowledge base to form an entity set comprises:
and judging whether the character strings in the first character string set can be directly identified in the preset knowledge base or not, if so, linking the directly identified character strings to corresponding entities in the preset knowledge base, acquiring entities directly associated with the linked entities from the preset knowledge base, and generating the entity set according to the linked entities and the directly associated entities.
7. The method according to claim 6, wherein the step of reading a plurality of entities in a predetermined knowledge base to form an entity set further comprises:
acquiring a preset association series, acquiring other entities which are associated with the directly associated entities except the linked entities from the preset knowledge base according to the association series, taking the other entities as indirectly associated entities of the linked entities, and adding the indirectly associated entities to the entity set.
8. A synonym recognition device, characterized by comprising:
the data acquisition module is used for carrying out named entity recognition on a text to be subjected to synonym recognition to obtain a first character string set and reading a plurality of entities in a preset knowledge base to form an entity set;
a search module, configured to search sequentially in at least one given data search engine using the elements in the first character string set and the entity set as keywords, and generate a second character string set according to a search result, where the second character string set includes identified character strings;
the first identification module is used for generating a plurality of synonym pairs based on the first character string set, the entity set and the second character string set, and linking at least one character string in the first character string set to an entity in the entity set according to the synonym pairs to complete synonym identification of at least one character string;
the labeling module is used for generating a third character string set according to the recognized character strings in the first character string set and the second character string set, performing supplementary recognition on the character strings in the first character string set based on the first character string set, the third character string set and the entity set, extracting character strings to be labeled from the first character string set after the supplementary recognition, and linking the extracted character strings to the entities in the entity set based on a labeling result after labeling;
and the control module is used for updating the third character string set according to the character strings subjected to supplementary recognition and marking, and then enabling the marking module to repeatedly execute supplementary recognition and marking on the character strings which are not recognized in the first character string set after marking until the recognition recall rate of the character strings in the first character string set meets a preset condition.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of a synonym recognition method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that computer-readable instructions are stored thereon, which, when executed by a processor, implement the steps of the synonym identification method of one of the claims 1 to 7.
CN202110479989.9A 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium Active CN113051900B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110479989.9A CN113051900B (en) 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110479989.9A CN113051900B (en) 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113051900A true CN113051900A (en) 2021-06-29
CN113051900B CN113051900B (en) 2023-08-22

Family

ID=76517864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110479989.9A Active CN113051900B (en) 2021-04-30 2021-04-30 Synonym recognition method, synonym recognition device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113051900B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720520A (en) * 2023-08-07 2023-09-08 烟台云朵软件有限公司 Text data-oriented alias entity rapid identification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN110633464A (en) * 2018-06-22 2019-12-31 北京京东尚科信息技术有限公司 Semantic recognition method, device, medium and electronic equipment
CN110825827A (en) * 2019-11-13 2020-02-21 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN110633464A (en) * 2018-06-22 2019-12-31 北京京东尚科信息技术有限公司 Semantic recognition method, device, medium and electronic equipment
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora
CN110825827A (en) * 2019-11-13 2020-02-21 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720520A (en) * 2023-08-07 2023-09-08 烟台云朵软件有限公司 Text data-oriented alias entity rapid identification method and system
CN116720520B (en) * 2023-08-07 2023-11-03 烟台云朵软件有限公司 Text data-oriented alias entity rapid identification method and system

Also Published As

Publication number Publication date
CN113051900B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN113407785B (en) Data processing method and system based on distributed storage system
EP3579119A1 (en) Method and apparatus for recognizing event information in text
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN113987125A (en) Text structured information extraction method based on neural network and related equipment thereof
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN114357117A (en) Transaction information query method and device, computer equipment and storage medium
CN114398477A (en) Policy recommendation method based on knowledge graph and related equipment thereof
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN113051900B (en) Synonym recognition method, synonym recognition device, computer equipment and storage medium
CN114090792A (en) Document relation extraction method based on comparison learning and related equipment thereof
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN112949320A (en) Sequence labeling method, device, equipment and medium based on conditional random field
CN111639164A (en) Question-answer matching method and device of question-answer system, computer equipment and storage medium
CN112528040A (en) Knowledge graph-based method for guiding textbook corpus detection and related equipment thereof
CN112182157A (en) Training method of online sequence labeling model, online labeling method and related equipment
CN114970553B (en) Information analysis method and device based on large-scale unmarked corpus and electronic equipment
CN115730603A (en) Information extraction method, device, equipment and storage medium based on artificial intelligence
CN114742058A (en) Named entity extraction method and device, computer equipment and storage medium
CN114565316A (en) Task issuing method based on micro-service architecture and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant