CN115080695A - Chinese similar character retrieval method and device based on knowledge graph and electronic equipment - Google Patents

Chinese similar character retrieval method and device based on knowledge graph and electronic equipment Download PDF

Info

Publication number
CN115080695A
CN115080695A CN202210752941.5A CN202210752941A CN115080695A CN 115080695 A CN115080695 A CN 115080695A CN 202210752941 A CN202210752941 A CN 202210752941A CN 115080695 A CN115080695 A CN 115080695A
Authority
CN
China
Prior art keywords
chinese
characters
similar
character
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210752941.5A
Other languages
Chinese (zh)
Inventor
贾伟
倪江柳伊
许春媛
董传磊
屈迪
张安洁
陈梓健
汪利飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rajax Network Technology Co Ltd
Original Assignee
Rajax Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rajax Network Technology Co Ltd filed Critical Rajax Network Technology Co Ltd
Priority to CN202210752941.5A priority Critical patent/CN115080695A/en
Publication of CN115080695A publication Critical patent/CN115080695A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application provides a Chinese similar character retrieval method and device based on a knowledge graph, electronic equipment and a storage medium, and relates to the technical field of internet. The method comprises the steps of obtaining a word to be retrieved, searching a Chinese similar word data pair where a Chinese character matched with the word to be retrieved is located in a pre-constructed knowledge graph, wherein the knowledge graph comprises a data pair representing the similar relation between every two Chinese characters; and acquiring the Chinese similar characters corresponding to the characters to be searched by utilizing the searched Chinese similar character data pairs. According to the embodiment of the application, the knowledge graph of the Chinese characters is constructed, the knowledge graph comprises the structured data pairs representing the similarity relation between every two Chinese characters, and the safety prevention and control capability of the Chinese content is enhanced by retrieving similar characters through the knowledge graph.

Description

Chinese similar character retrieval method and device based on knowledge graph and electronic equipment
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for retrieving similar chinese characters based on a knowledge graph, an electronic device, and a storage medium.
Background
Along with the increasing presentation of the Chinese Internet contents, the attention of users to Internet platforms is also increasing. The expression forms of Chinese content are diverse, so that the same content has a plurality of expression modes, and the traditional prevention and control means are increasingly caught by the Internet platform for the purposes of dealing with black and grey products and the like. Aiming at dealing with the black and gray products and the like, how to efficiently and accurately search similar characters, realize the improvement and the expansion of Chinese content and enhance the safety prevention and control capability of the Chinese content becomes a technical problem which needs to be solved urgently.
Disclosure of Invention
In view of the above problems, the present application is made to provide a method and apparatus for retrieving similar chinese characters based on a knowledge-graph, an electronic device, and a storage medium that overcome or at least partially solve the above problems. The technical scheme is as follows:
in a first aspect, a method for retrieving Chinese similar characters based on a knowledge graph is provided, which includes:
acquiring a word to be retrieved, and searching a Chinese similar word data pair where a Chinese character matched with the word to be retrieved is located in a pre-constructed knowledge graph, wherein the knowledge graph comprises a data pair representing the similar relation between every two Chinese characters;
and acquiring the Chinese similar characters corresponding to the characters to be searched by utilizing the searched Chinese similar character data pairs.
In one possible implementation, the knowledge-graph is constructed by:
acquiring a plurality of Chinese characters, and constructing a Chinese character characteristic index aiming at each Chinese character in the plurality of Chinese characters;
determining similar characters in the plurality of Chinese characters according to the Chinese character characteristic indexes of the Chinese characters;
generating a data pair representing the similarity relation between every two Chinese characters according to the determined similar characters in the Chinese characters;
and constructing a knowledge graph by using the data pairs representing the similarity between every two Chinese characters as knowledge items.
In one possible implementation, the chinese character characteristic index includes: one or more of pinyin index, structure index, character splitting index, four-corner code index, five-stroke index, stroke sequence index and semantic index.
In a possible implementation manner, determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character includes:
and based on the pinyin indexes of the Chinese characters, uniformly processing the front nasal sound and the rear nasal sound, the flat tongue sound and the curled tongue sound, and determining that the Chinese characters with the same pinyin are near characters.
In a possible implementation manner, determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character includes:
and determining a side part and a rest part of each Chinese character based on the structure index and the character splitting index of each Chinese character, and determining the Chinese characters with the same rest part in the plurality of Chinese characters as similar characters.
In a possible implementation manner, determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character includes:
determining the parts with the same positions and the same stroke sequences as a public string based on the stroke sequence indexes of the Chinese characters;
and determining the Chinese characters with the longest continuous public character string length in the plurality of Chinese characters, which accounts for the total length of the Chinese character stroke sequences, more than or equal to a first preset proportion threshold value as the similar characters.
In a possible implementation manner, determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character includes:
and determining the Chinese characters of which the same parts of the codes in the plurality of Chinese characters are greater than or equal to a second preset proportion threshold value as the similar characters based on the four-corner code index or the five-stroke index of each Chinese character.
In a possible implementation manner, determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character includes:
inputting a pre-constructed relation prediction model between Chinese character nodes by taking the Chinese character characteristic index of each Chinese character as a characteristic;
and predicting the relation among the Chinese characters by using the relation prediction model among the Chinese character nodes to determine the shape and the shape of the Chinese characters.
In a possible implementation manner, obtaining the chinese similar word corresponding to the word to be retrieved by using the searched chinese similar word data pair includes:
extracting the characters and the similar relation in the Chinese similar character data pairs by using the searched Chinese similar character data pairs;
and acquiring the Chinese similar characters corresponding to the characters to be retrieved according to the extracted Chinese similar character data pairs and the similarity relation.
In a second aspect, a chinese similar word retrieval device based on knowledge graph is provided, which includes:
the first acquisition module is used for acquiring the word to be retrieved;
the searching module is used for searching a Chinese similar character data pair where the Chinese characters matched with the characters to be searched are located in a pre-constructed knowledge graph, wherein the knowledge graph comprises a data pair representing the similar relation between every two Chinese characters;
and the second acquisition module is used for acquiring the Chinese similar characters corresponding to the characters to be retrieved by utilizing the searched Chinese similar character data pairs.
In one possible implementation, the apparatus further includes:
the building module is used for obtaining a plurality of Chinese characters and building a Chinese character characteristic index aiming at each Chinese character in the plurality of Chinese characters;
determining similar characters in the plurality of Chinese characters according to the Chinese character characteristic indexes of the Chinese characters;
generating a data pair representing the similarity relation between every two Chinese characters according to the determined similar characters in the Chinese characters;
and constructing a knowledge graph by using the data pairs representing the similarity between every two Chinese characters as knowledge items.
In one possible implementation, the chinese character characteristic index includes: one or more of pinyin index, structure index, character splitting index, four-corner code index, five-stroke index, stroke sequence index and semantic index.
In one possible implementation, the building module is further configured to:
and based on the pinyin indexes of the Chinese characters, uniformly processing the front nasal sound and the rear nasal sound, the flat tongue sound and the curled tongue sound, and determining that the Chinese characters with the same pinyin are near characters.
In one possible implementation, the building module is further configured to:
and determining a side part and a rest part of each Chinese character based on the structure index and the character splitting index of each Chinese character, and determining the Chinese characters with the same rest part in the plurality of Chinese characters as similar characters.
In one possible implementation, the building module is further configured to:
determining the parts with the same positions and the same stroke sequences as a public string based on the stroke sequence indexes of the Chinese characters;
and determining the Chinese characters with the longest continuous public character string length in the plurality of Chinese characters, which accounts for the total length of the Chinese character stroke sequences, more than or equal to a first preset proportion threshold value as the similar characters.
In one possible implementation, the building module is further configured to:
and determining the Chinese characters of which the same parts of the codes in the plurality of Chinese characters are greater than or equal to a second preset proportion threshold value as the similar characters based on the four-corner code index or the five-stroke index of each Chinese character.
In one possible implementation, the building module is further configured to:
inputting a pre-constructed relation prediction model between Chinese character nodes by taking the Chinese character characteristic index of each Chinese character as a characteristic;
and predicting the relation among the Chinese characters by using the relation prediction model among the Chinese character nodes to determine the approximate form characters in the Chinese characters.
In a possible implementation manner, the second obtaining module is further configured to:
extracting the characters and the similar relation in the Chinese similar character data pairs by using the searched Chinese similar character data pairs;
and acquiring the Chinese similar characters corresponding to the characters to be retrieved according to the extracted Chinese similar character data pairs and the similarity relation.
In a third aspect, an electronic device is provided, which includes a processor and a memory, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for retrieving the chinese similar words based on the knowledge-graph according to any one of the above embodiments.
In a fourth aspect, a storage medium is provided, where the storage medium stores a computer program, where the computer program is configured to execute the method for retrieving chinese similar words based on a knowledge-graph of any one of the above when running.
By means of the technical scheme, the Chinese similar character retrieval method and device based on the knowledge graph, the electronic equipment and the storage medium, provided by the embodiment of the application, are used for acquiring a character to be retrieved, searching a Chinese similar character data pair where a Chinese character matched with the character to be retrieved is located in the pre-constructed knowledge graph, wherein the knowledge graph comprises the data pair representing the similar relation between every two Chinese characters; and acquiring the Chinese similar characters corresponding to the characters to be searched by utilizing the searched Chinese similar character data pairs. According to the embodiment of the application, the knowledge graph of the Chinese characters is constructed, the knowledge graph comprises the structured data pairs representing the similarity relation between every two Chinese characters, and the safety prevention and control capability of the Chinese content is enhanced by retrieving similar characters through the knowledge graph.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a flowchart illustrating a method for retrieving similar Chinese characters based on a knowledge-graph according to an embodiment of the present application;
FIG. 2 is a schematic diagram of Chinese character property indices, node triples, and relationship triples of a knowledge-graph underlying data store provided by another embodiment of the present application;
FIG. 3 shows a graphical illustration of a knowledge-graph provided by another embodiment of the present application;
FIG. 4 is a block diagram of a knowledge-graph based Chinese similar word retrieval apparatus according to an embodiment of the present application;
FIG. 5 is a block diagram of a knowledge-graph based Chinese similar words retrieval device according to another embodiment of the present application;
fig. 6 shows a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that such uses are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to".
Before describing embodiments of the present application in detail, the following technical terms are introduced.
Knowledge Graph (Knowledge Graph): the knowledge domain visualization or knowledge domain mapping map is a series of different graphs for displaying the relationship between the knowledge development process and the structure, and the visualization technology is used for describing knowledge resources and carriers thereof, mining, analyzing, constructing, drawing and displaying knowledge and the mutual relation between the knowledge resources and the carriers. The knowledge graph is a modern theory which achieves the aim of multi-discipline fusion by combining theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with methods such as metrology introduction analysis, co-occurrence analysis and the like and utilizing a visualized graph to vividly display the core structure, development history, frontier field and overall knowledge framework of the subjects.
Triplet: the data structure stored in the knowledge graph bottom layer data can be divided into node triples and relation triples according to types, wherein the node triples store attribute information of nodes, and the relation triples are used for representing relations between the nodes.
In the embodiment of the application, the node in the node triple and the relationship triple may be a Chinese character, and the node triple may store the attribute and the value of the attribute of the Chinese character and the Chinese character. The relationship triplets may store two Chinese characters and the similarity relationship between two Chinese characters.
Chinese language characteristics: chinese pinyin, strokes, four-corner coding, five-stroke retrieval, components and remainders, structures, compositions and the like.
Chinese character characteristic index: the Chinese language feature-based construction can comprise one or more items of pinyin indexes, structure indexes, character splitting indexes, four-corner code indexes, five-stroke indexes, stroke sequence indexes and semantic indexes.
And (3) relation prediction: in the knowledge Graph, a common method for predicting a relationship between nodes, that is, predicting a relationship between a Chinese character and a Chinese character, includes GCN (Graph Convolutional Network), CNN (Convolutional Neural Network), and the like.
In order to solve the above technical problem, an embodiment of the present application provides a method for retrieving a chinese similar word based on a knowledge graph, as shown in fig. 1, the method for retrieving a chinese similar word based on a knowledge graph may include the following steps S101 and S102:
step S101, obtaining a word to be retrieved, and searching a Chinese similar word data pair where a Chinese character matched with the word to be retrieved is located in a pre-constructed knowledge graph, wherein the knowledge graph comprises a data pair representing the similar relation between every two Chinese characters.
In the step, the word to be retrieved can be obtained from a text, a picture, an audio/video or the like, taking the text as an example, one or more words to be retrieved can be obtained from the text, and for each word to be retrieved, a Chinese similar word data pair where the Chinese character matched with the word to be retrieved is located is searched in a pre-constructed knowledge map.
The knowledge graph can contain node triples and relation triples, and the node triples can store Chinese characters and attributes and attribute values of the Chinese characters; the relationship triplets can store two Chinese characters and the similarity relationship between two Chinese characters, and the relationship triplets are data pairs representing the similarity relationship between two Chinese characters.
And S102, acquiring the Chinese similar characters corresponding to the characters to be searched by utilizing the searched Chinese similar character data pairs.
The method and the device can acquire the word to be retrieved, and search the Chinese similar word data pair where the Chinese character matched with the word to be retrieved is located in a pre-constructed knowledge graph, wherein the knowledge graph comprises the data pair representing the similar relation between every two Chinese characters; and acquiring the Chinese similar characters corresponding to the characters to be searched by utilizing the searched Chinese similar character data pairs. The embodiment of the application realizes the improvement and the expansion of the Chinese content by constructing the knowledge graph of the Chinese characters, wherein the knowledge graph comprises the structured data pair representing the similarity relation between every two Chinese characters, and the knowledge graph is used for searching similar characters, thereby enhancing the safety prevention and control capability of the Chinese content.
A possible implementation manner is provided in the embodiment of the present application, and the knowledge graph may be constructed through the following steps a1 to a 4:
step A1, obtaining a plurality of Chinese characters, and constructing a Chinese character characteristic index for each Chinese character in the plurality of Chinese characters.
In this step, a Chinese character characteristic index can be constructed according to Chinese language characteristics, specifically including Chinese pinyin, strokes, four-corner coding, five-stroke search, components and remainders, structures, compositions, and the like. The Chinese character characteristic index constructed here may include pinyin index, structure index, character splitting index, four-corner code index, five-stroke index, stroke sequence index, semantic index, etc., such as the Chinese character characteristic index stored in the knowledge graph underlying data shown in fig. 2.
For example, the pinyin index: such as: huai4
And (3) structural indexing: such as: structure of left and right bad- >
Character splitting indexing: such as: root of ruo-earth
Four-corner code index: such as: bad- >81790
Five-stroke indexing: such as bad- > FGIY
Stroke indexing: such as: bad- > horizontal, vertical, lifting, horizontal, left-falling, vertical and dot
It should be noted that the above examples are merely illustrative and do not limit the embodiments of the present application.
Step A2, according to the Chinese character characteristic index of each Chinese character, determining the similar character in the plurality of Chinese characters.
And step A3, generating a data pair representing the similarity relation between every two Chinese characters according to the determined similar characters in the plurality of Chinese characters.
As shown in fig. 2, the knowledge graph may include a node triple and a relationship triple, where the node triple may store a Chinese character, an attribute of the Chinese character, and a value of the attribute; the relationship triplets can store two Chinese characters and the similarity relationship between two Chinese characters, and the relationship triplets are data pairs representing the similarity relationship between two Chinese characters. It should be noted that the example in fig. 2 is only illustrative and does not limit the embodiment of the present application.
And step A4, constructing a knowledge graph by using data pairs representing the similarity between every two Chinese characters as knowledge items.
As described above, the character characteristic index, the node triplet and the relationship triplet in the knowledge graph bottom data storage, the constructed knowledge graph can be as shown in fig. 3, taking a bad blank and a blank as an example, both show respective node triplet information, and the bad blank and the blank are similar characters; taking bad and plutonium as an example, the bad and plutonium both show respective node triple information, and the bad and the blank are similar characters; taking bad and nostalgic as an example, the bad and nostalgic both show respective node triple information, and the bad and nostalgic are similar characters and similar characters; taking the bad sum as an example, the bad sum and the bad sum show respective node triple information, and the bad sum is a similar word; taking a bad ring and a ring as examples, the bad ring and the ring both show respective node triple information, and the bad ring and the ring are approximate characters; taking the bad and projection as an example, both show respective node triplet information, and the bad and projection are projection near words.
In FIG. 3, the knowledge-map is also indexed with respect to a neutral characteristic of the projection and the embryo, such as between the projection and the embryo, to determine whether the projection and the embryo have a similar relationship; if there is a question mark between the rings, it can be determined whether the rings have a similar relationship according to the Chinese character characteristic index of the rings. It should be noted that the example in fig. 3 is only illustrative and does not limit the embodiments of the present application.
With the knowledge graph of fig. 3, the relation between the Chinese characters and the Chinese characters can be easily determined manually, but under the condition of limited manpower, the relation between all the Chinese characters is difficult to be exhausted completely, so that reasoning is needed, the embodiment of the application provides a reasoning scheme based on rules and models, and the reasoning scheme based on rules considers that the Chinese characters are similar to each other in terms of pronunciation and the Chinese characters are similar to each other in terms of remaining parts, such as pinyin; the inference scheme based on the model may be to convert the Chinese language feature index into features and then perform algorithmic inference, and common ways include GCN, CNN, and the like, which will be described in detail below.
In the embodiment of the present application, a possible implementation manner is provided, in the step a2, similar characters in a plurality of chinese characters are determined according to the chinese character characteristic index of each chinese character, specifically, similar characters in a plurality of chinese characters may be determined based on the pinyin index of each chinese character, and after the front nasal sound and the rear nasal sound, the tongue-flattening sound and the tongue-curling sound are processed in a unified manner, it is determined that the pinyin in a plurality of chinese characters is the same as the near character. For example, bad huai4 and huai2 are near words, blanks pi1 and Brassica pi1 are near words, and the like.
In this embodiment, a possible implementation manner is provided, in the step a2, a similar word in the plurality of chinese characters is determined according to the chinese character characteristic index of each chinese character, specifically, a side part and a remainder of each chinese character may be determined based on the structure index and the character splitting index of each chinese character, and a chinese character having the same remainder in the plurality of chinese characters is determined as a similar word. For example, the rest of bad and carry is not, both are near-word; the rest of the bad sum is not, the two are similar characters, and the like.
In the embodiment of the present application, a possible implementation manner is provided, in which in the step a2, according to the Chinese character characteristic index of each Chinese character, a similar character in a plurality of Chinese characters is determined, and specifically, based on the stroke sequence index of each Chinese character, a part having the same position and the same stroke sequence is determined as a common string; and determining the Chinese characters with the longest continuous public character string length in the plurality of Chinese characters, which accounts for the total length of the Chinese character stroke sequences, more than or equal to a first preset proportion threshold value as the similar characters. For example, the children and the writings are in the shape of a word, the wins and the wins are in the shape of a word, the bad and the blank are in the shape of a word, and the like.
In this embodiment, a possible implementation manner is provided, in which in step a2, similar words in the plurality of chinese characters are determined according to the chinese character characteristic index of each chinese character, and particularly, a chinese character in which the same part of codes in the plurality of chinese characters is greater than or equal to a second preset proportion threshold may be determined as a near-word based on a four-corner index or five-stroke index of each chinese character. For example, the wind and the phoenix are shaped like a character, etc.
In the embodiment of the present application, a possible implementation manner is provided, where the step a2 determines similar words in multiple chinese characters according to the chinese character characteristic index of each chinese character, and in the inference scheme based on the model, the following steps B1 and B2 may be specifically included:
and step B1, inputting the Chinese character characteristic indexes of the Chinese characters as features into a pre-constructed relation prediction model between Chinese character nodes.
And step B2, predicting the relation between the Chinese characters by using a relation prediction model between the Chinese character nodes, and determining the form-similar characters in the Chinese characters.
In the steps B1 and B2, a relationship prediction model between the chinese character nodes may be pre-constructed by using an algorithm such as GCN or CNN; by adopting the scheme based on the rules, the preliminary relation construction is carried out on the ways of the same pinyin, the congruence parts (such as bad part, wye part and the like) and the similar stroke sequences (such as thousand part, stem part and the like) to obtain a batch of combinations of shape characters and pronunciation characters as initial samples; training a pre-constructed relation prediction model between Chinese character nodes based on an initial sample to obtain a trained relation prediction model between the Chinese character nodes; after the characteristics of the Chinese characters in the trained relation prediction model among the Chinese character nodes are used for carrying out feature expression, relation prediction among the Chinese character nodes is carried out, and therefore the graph network of the knowledge graph is perfected.
The embodiment comprehensively quotes various language characteristics such as pinyin, radicals and remainders, four-corner codes, five strokes, stroke sequences and the like, and based on the language characteristics, the similar characters are stored and retrieved by initially utilizing a knowledge map mode, thereby realizing the improvement and the expansion of Chinese content and enhancing the safety control capability of the Chinese content.
In the embodiment of the present application, a possible implementation manner is provided, in the above step S102, the searched chinese similar character data pair is used to obtain a chinese similar character corresponding to the word to be retrieved, specifically, the searched chinese similar character data pair is used to extract two chinese characters and a similar relationship in the chinese similar character data pair; and acquiring the Chinese similar characters corresponding to the characters to be retrieved according to the pairwise Chinese characters and the similar relation in the extracted Chinese similar character data pairs.
For example, the word "hua" to be retrieved is obtained, and the similar Chinese words corresponding to the word "hua" may be changed, hua, swoosh, bright-bright, birch, light, flower, corrupt, goods, ploughshare, boot, Wei, rail, and the like obtained in steps S101 and S102.
It should be noted that, in practical applications, all the possible embodiments described above may be combined in a combined manner at will to form possible embodiments of the present application, and details are not described here again.
Based on the Chinese similar word retrieval method based on the knowledge graph provided by each embodiment, the embodiment of the application also provides a Chinese similar word retrieval device based on the knowledge graph based on the same invention concept.
Fig. 4 is a block diagram illustrating a chinese similar word retrieval apparatus based on a knowledge-graph according to an embodiment of the present application. As shown in fig. 4, the apparatus for retrieving chinese similar words based on a knowledge-graph may specifically include a first obtaining module 410, a searching module 420, and a second obtaining module 430.
A first obtaining module 410, configured to obtain a word to be retrieved;
the searching module 420 is configured to search a pre-constructed knowledge graph for a Chinese similar character data pair where a Chinese character matched with the word to be retrieved is located, where the knowledge graph includes a data pair representing a similarity relationship between every two Chinese characters;
a second obtaining module 430, configured to obtain, by using the found chinese similar word data pair, a chinese similar word corresponding to the word to be retrieved.
In the embodiment of the present application, a possible implementation manner is provided, as shown in fig. 5, the apparatus shown in fig. 4 above may further include a constructing module 510, configured to obtain a plurality of chinese characters, and construct a chinese character characteristic index for each of the plurality of chinese characters;
determining similar characters in the plurality of Chinese characters according to the Chinese character characteristic indexes of the Chinese characters;
generating a data pair representing the similarity relation between every two Chinese characters according to the determined similar characters in the Chinese characters;
and constructing a knowledge graph by using the data pairs representing the similarity between every two Chinese characters as knowledge items.
The embodiment of the present application provides a possible implementation manner, where the Chinese character characteristic index includes: one or more items of pinyin index, structure index, character splitting index, quadrangle code index, five-stroke index, stroke sequence index and semantic index.
In the embodiment of the present application, a possible implementation manner is provided, and the building module 510 shown in fig. 5 is further configured to:
and based on the pinyin indexes of the Chinese characters, uniformly processing the front nasal sound and the rear nasal sound, the flat tongue sound and the curled tongue sound, and determining that the Chinese characters with the same pinyin are near characters.
In the embodiment of the present application, a possible implementation manner is provided, and the building module 510 shown in fig. 5 is further configured to:
and determining a side part and a rest part of each Chinese character based on the structure index and the character splitting index of each Chinese character, and determining the Chinese characters with the same rest part in the plurality of Chinese characters as similar characters.
In the embodiment of the present application, a possible implementation manner is provided, and the building module 510 shown in fig. 5 is further configured to:
determining the parts with the same positions and the same stroke sequences as a public string based on the stroke sequence indexes of the Chinese characters;
and determining the Chinese characters with the longest continuous public character string length in the plurality of Chinese characters, which accounts for the total length of the Chinese character stroke sequences, more than or equal to a first preset proportion threshold value as the similar characters.
In the embodiment of the present application, a possible implementation manner is provided, and the building module 510 shown in fig. 5 is further configured to:
and determining the Chinese characters of which the same parts of the codes in the plurality of Chinese characters are greater than or equal to a second preset proportion threshold value as the similar characters based on the four-corner code index or the five-stroke index of each Chinese character.
In the embodiment of the present application, a possible implementation manner is provided, and the building module 510 shown in fig. 5 is further configured to:
inputting a pre-constructed relation prediction model between Chinese character nodes by taking the Chinese character characteristic index of each Chinese character as a characteristic;
and predicting the relation among the Chinese characters by using the relation prediction model among the Chinese character nodes to determine the shape and the shape of the Chinese characters.
In the embodiment of the present application, a possible implementation manner is provided, and the second obtaining module 430 shown in fig. 4 or fig. 5 is further configured to:
extracting the characters and the similar relation in the Chinese similar character data pairs by using the searched Chinese similar character data pairs;
and acquiring the Chinese similar characters corresponding to the characters to be retrieved according to the extracted Chinese similar character data pairs and the similarity relation.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute the computer program to execute the method for retrieving the chinese similar words based on the knowledge graph according to any of the above embodiments.
In an exemplary embodiment, there is provided an electronic device, as shown in fig. 6, the electronic device 600 shown in fig. 6 including: a processor 601 and a memory 603. The processor 601 is coupled to the memory 603, such as via a bus 602. Optionally, the electronic device 600 may also include a transceiver 604. It should be noted that the transceiver 604 is not limited to one in practical applications, and the structure of the electronic device 600 is not limited to the embodiment of the present application.
The Processor 601 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 601 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.
Bus 602 may include a path that transfers information between the above components. The bus 602 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The Memory 603 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical disk storage, optical disk storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 603 is used for storing computer program code for performing the solution of the present application and is controlled by the processor 601 for execution. The processor 601 is adapted to execute computer program code stored in the memory 603 to implement the content shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
Based on the same inventive concept, the present application further provides a storage medium, in which a computer program is stored, where the computer program is configured to execute the method for retrieving chinese similar words based on a knowledge graph according to any one of the above embodiments when running.
It can be clearly understood by those skilled in the art that the specific working processes of the system, the apparatus, and the module described above may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, the details are not repeated herein.
Those of ordinary skill in the art will understand that: the technical solution of the present application may be essentially or wholly or partially embodied in the form of a software product, where the computer software product is stored in a storage medium and includes program instructions for enabling an electronic device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application when the program instructions are executed. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (an electronic device such as a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the electronic device, the electronic device executes all or part of the steps of the method described in the embodiments of the present application.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present application; such modifications or substitutions do not depart from the scope of the present application.

Claims (10)

1. A Chinese similar word retrieval method based on knowledge graph is characterized by comprising the following steps:
acquiring a word to be retrieved, and searching a Chinese similar word data pair where a Chinese character matched with the word to be retrieved is located in a pre-constructed knowledge graph, wherein the knowledge graph comprises a data pair representing the similar relation between every two Chinese characters;
and acquiring the Chinese similar characters corresponding to the characters to be searched by utilizing the searched Chinese similar character data pairs.
2. The method of claim 1, wherein the knowledge-graph is constructed by:
acquiring a plurality of Chinese characters, and constructing a Chinese character characteristic index aiming at each Chinese character in the plurality of Chinese characters;
determining similar characters in the plurality of Chinese characters according to the Chinese character characteristic indexes of the Chinese characters;
generating a data pair representing the similarity relation between every two Chinese characters according to the determined similar characters in the Chinese characters;
and constructing a knowledge graph by using the data pairs representing the similarity between every two Chinese characters as knowledge items.
3. The method of claim 2, wherein the chinese character characteristic index comprises: one or more items of pinyin index, structure index, character splitting index, quadrangle code index, five-stroke index, stroke sequence index and semantic index.
4. The method of claim 3, wherein determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character comprises:
and based on the pinyin indexes of the Chinese characters, uniformly processing the front nasal sound and the rear nasal sound, the flat tongue sound and the curled tongue sound, and determining that the Chinese characters with the same pinyin are near characters.
5. The method of claim 3, wherein determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character comprises:
and determining a side part and a rest part of each Chinese character based on the structure index and the character splitting index of each Chinese character, and determining the Chinese characters with the same rest part in the plurality of Chinese characters as similar characters.
6. The method of claim 3, wherein determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character comprises:
determining the parts with the same positions and the same stroke sequences as a public string based on the stroke sequence indexes of the Chinese characters;
and determining the Chinese characters with the longest continuous public character string length in the plurality of Chinese characters, which accounts for the total length of the Chinese character stroke sequences, more than or equal to a first preset proportion threshold value as the similar characters.
7. The method of claim 3, wherein determining similar words in the plurality of Chinese characters according to the Chinese character characteristic index of each Chinese character comprises:
and determining the Chinese characters of which the same parts of the codes in the plurality of Chinese characters are greater than or equal to a second preset proportion threshold value as the similar characters based on the four-corner code index or the five-stroke index of each Chinese character.
8. A Chinese similar word retrieval device based on knowledge graph is characterized by comprising:
the first acquisition module is used for acquiring the word to be retrieved;
the searching module is used for searching a Chinese similar character data pair where the Chinese characters matched with the characters to be searched are located in a pre-constructed knowledge graph, wherein the knowledge graph comprises a data pair representing the similar relation between every two Chinese characters;
and the second acquisition module is used for acquiring the Chinese similar characters corresponding to the characters to be retrieved by utilizing the searched Chinese similar character data pairs.
9. An electronic device comprising a processor and a memory, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for retrieving Chinese similar words based on a knowledge-graph according to any one of claims 1 to 7.
10. A storage medium having a computer program stored therein, wherein the computer program is configured to execute the method for retrieving chinese similar words based on a knowledge-graph of any one of claims 1 to 7 when running.
CN202210752941.5A 2022-06-29 2022-06-29 Chinese similar character retrieval method and device based on knowledge graph and electronic equipment Pending CN115080695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210752941.5A CN115080695A (en) 2022-06-29 2022-06-29 Chinese similar character retrieval method and device based on knowledge graph and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210752941.5A CN115080695A (en) 2022-06-29 2022-06-29 Chinese similar character retrieval method and device based on knowledge graph and electronic equipment

Publications (1)

Publication Number Publication Date
CN115080695A true CN115080695A (en) 2022-09-20

Family

ID=83255790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210752941.5A Pending CN115080695A (en) 2022-06-29 2022-06-29 Chinese similar character retrieval method and device based on knowledge graph and electronic equipment

Country Status (1)

Country Link
CN (1) CN115080695A (en)

Similar Documents

Publication Publication Date Title
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN109657054B (en) Abstract generation method, device, server and storage medium
US8364470B2 (en) Text analysis method for finding acronyms
US20220318275A1 (en) Search method, electronic device and storage medium
JP6462970B1 (en) Classification device, classification method, generation method, classification program, and generation program
JP7295189B2 (en) Document content extraction method, device, electronic device and storage medium
US10108661B2 (en) Using synthetic events to identify complex relation lookups
US11386114B2 (en) Structure-based transformers with localization and encoding for chart question answering
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
KR20210034679A (en) Identify entity-attribute relationships
US20220358280A1 (en) Context-aware font recommendation from text
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
JP2020060970A (en) Context information generation method, context information generation device and context information generation program
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
US11803796B2 (en) System, method, electronic device, and storage medium for identifying risk event based on social information
US10229156B2 (en) Using priority scores for iterative precision reduction in structured lookups for questions
CN111382243A (en) Text category matching method, text category matching device and terminal
CN110969005A (en) Method and device for determining similarity between entity corpora
CN115080695A (en) Chinese similar character retrieval method and device based on knowledge graph and electronic equipment
CN114491076A (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN115495636A (en) Webpage searching method, device and storage medium
US20200027566A1 (en) Identification of co-located artifacts in cognitively analyzed corpora
CN112541069A (en) Text matching method, system, terminal and storage medium combined with keywords
JP6181890B2 (en) Literature analysis apparatus, literature analysis method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination