CN116595124B - Cross-language implicit associated knowledge discovery method, device, equipment and storage medium - Google Patents

Cross-language implicit associated knowledge discovery method, device, equipment and storage medium Download PDF

Info

Publication number
CN116595124B
CN116595124B CN202310873311.8A CN202310873311A CN116595124B CN 116595124 B CN116595124 B CN 116595124B CN 202310873311 A CN202310873311 A CN 202310873311A CN 116595124 B CN116595124 B CN 116595124B
Authority
CN
China
Prior art keywords
keyword
language
occurrence
order
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310873311.8A
Other languages
Chinese (zh)
Other versions
CN116595124A (en
Inventor
徐红姣
何彦青
刘志辉
王莉军
兰天
许德山
潘优
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Scientific And Technical Information Of China
Original Assignee
Institute Of Scientific And Technical Information Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Scientific And Technical Information Of China filed Critical Institute Of Scientific And Technical Information Of China
Priority to CN202310873311.8A priority Critical patent/CN116595124B/en
Publication of CN116595124A publication Critical patent/CN116595124A/en
Application granted granted Critical
Publication of CN116595124B publication Critical patent/CN116595124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a cross-language implicit associated knowledge discovery method, device, equipment and storage medium, and relates to the technical field of knowledge discovery. The method comprises the following steps: extracting keywords from the first language technical literature data set and the second language technical literature data set respectively to obtain a first language keyword set and a second language keyword set, further performing first language translation and word sense disambiguation on each second language keyword to obtain a target translation keyword set, obtaining at least one potential hidden associated word pair formed by keywords from different language technical literature data sets and the association degree of words in each potential hidden associated word pair according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword, and screening each potential hidden associated word pair according to each association degree to obtain a target hidden associated word pair. The reliability of implicit association between acquired knowledge can be effectively improved.

Description

Cross-language implicit associated knowledge discovery method, device, equipment and storage medium
Technical Field
The application relates to the technical field of knowledge discovery, in particular to a cross-language implicit associated knowledge discovery method, device, equipment and storage medium.
Background
The Literature-based knowledge Discovery (LBD) method is a classical informatics method that discovers the transmission of knowledge and implicit association by Literature mining. LBD usually uses publicly published scientific literature as an analysis object, and discovers potential "new" knowledge by mining knowledge concepts and implicit associations between knowledge concepts through text mining, natural language processing, machine learning and other technologies.
Because grammar, vocabulary, structure and culture differences exist between different languages, the meaning and relation of knowledge are difficult to capture, information ambiguity and misunderstanding exist in the knowledge between different languages, the implicit association between the knowledge is mismatched and distorted, and the acquired implicit association between the knowledge is low in reliability. In addition, there is the problem of knowledge imbalance between different languages, the quantity and quality of scientific literature between different languages are different, knowledge imbalance can lead to when cross-language knowledge is found, the implicit association between certain knowledge is overestimated or underestimated, and the implicit association between acquired knowledge is low in reliability.
At present, most of knowledge discovery methods based on literature are single language knowledge discovery methods, and none of the knowledge discovery methods can solve the problems, in this case, it is highly desirable to provide a cross-language implicit association knowledge discovery scheme to improve the reliability of the implicit association between acquired cross-language knowledge.
Disclosure of Invention
The application aims to at least solve one of the technical defects, and the technical scheme provided by the embodiment of the application is as follows:
in a first aspect, an embodiment of the present application provides a cross-language implicit association knowledge discovery method, including:
extracting keywords from the first language technical literature data set to obtain a first language keyword set, and extracting keywords from the second language technical literature data set to obtain a second language keyword set;
performing first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain a target translation keyword set;
acquiring at least one potential hidden associated word pair and the association degree of words in each potential hidden associated word pair according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword; wherein, the two keywords in the hidden associated word pair are only from the first language technical literature data set and the second language technical literature data set respectively; the language attribute represents the language type of the scientific literature data set from which the key words are derived;
And screening each potential hidden associated word pair according to each association degree to obtain a target hidden associated word pair.
In an optional embodiment of the present application, performing a first language translation and word sense disambiguation process on each second language keyword in the second language keyword set to obtain a target translated text keyword set, specifically including:
translating each second language keyword in the second language keyword set into the first language according to the bilingual scientific dictionary corresponding to the first language and the second language, and obtaining an initial translated text keyword set; wherein each second language keyword corresponds to at least one translation keyword in the initial translation keyword set;
filtering translation keywords which are not in the first language keyword set in the initial translation keyword set to obtain an intermediate translation keyword set;
and screening out unique translation keywords corresponding to each second language keyword in the intermediate translation keyword set to obtain target translation keywords.
In an alternative embodiment of the present application, screening out a unique translation keyword corresponding to each second language keyword in the intermediate translation keyword set specifically includes:
for each second language keyword in the second language keyword set, acquiring a context word vector corresponding to the second language keyword according to the text containing the second language keyword in the second language scientific literature data set;
Acquiring word vectors of each translation keyword according to each translation keyword corresponding to the second language keyword in the intermediate translation set;
according to the word vector and the context word vector of each translation keyword, obtaining the similarity between the word vector of each translation keyword and the context word vector corresponding to the second language keyword;
and screening out the translation keywords corresponding to the maximum similarity according to the similarities as unique translation keywords.
In an alternative embodiment of the application, the co-occurrence relationship comprises: a first order co-occurrence relationship and a second order co-occurrence relationship;
the acquisition method of the potential hidden associated word pair comprises the following steps:
obtaining first-order co-occurrence word pairs according to first-order co-occurrence relations among keywords in the first language keyword set and the target translation keyword set;
obtaining a second-order co-occurrence word pair according to the second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
and obtaining potential hidden associated word pairs according to the first-order co-occurrence word pairs, the second-order co-occurrence word pairs and the language attribute of each keyword in each word pair.
In an alternative embodiment of the present application, the obtaining a potential implicit related word pair according to the first order co-occurrence word pair, the second order co-occurrence word pair and the language attribute of each keyword in each word pair specifically includes:
Acquiring a keyword association path corresponding to each keyword according to the first-order co-occurrence word pair and the second-order co-occurrence word pair; the keyword association path comprises keywords, first-order co-occurrence words corresponding to the keywords, second first-order co-occurrence words and first second-order co-occurrence words; the second first-order co-occurrence word and the first second-order co-occurrence word form a first-order co-occurrence word pair;
acquiring a target keyword association path according to the language attribute of each keyword in each keyword association path;
and for each target keyword association path, acquiring potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word.
In an optional embodiment of the present application, obtaining a target keyword association path according to a language attribute of each keyword in each keyword association path specifically includes:
according to the language attribute of each keyword in each keyword association path, screening out the keyword association paths meeting the preset language attribute screening rule in the keyword association paths as target keyword association paths;
the preset language attribute screening rule comprises the following steps:
determining that the keywords are simultaneously derived from the first language and technology literature data set and the second language and technology literature data set according to the language attribute of the keywords;
According to the language attribute of the first-order co-occurrence word and the language attribute of the first second-order co-occurrence word, determining that the first-order co-occurrence word and the first second-order co-occurrence word are respectively only derived from the first language-technology literature data set and the second language-technology literature data set.
In an optional embodiment of the present application, the method for obtaining the association degree of the potential implicit association word to the medium word includes:
obtaining first-order co-occurrence probability of the keyword and the first-order co-occurrence word, and obtaining second-order co-occurrence probability of the keyword and the first second-order co-occurrence word;
and obtaining the association degree of the potential hidden associated words to the words in the medium according to the product of the first-order co-occurrence probability and the second-order co-occurrence probability.
In a second aspect, an embodiment of the present application provides a cross-language implicit association knowledge discovery apparatus, including:
the keyword set acquisition module is used for extracting keywords from the first language technical literature data set to obtain a first language keyword set, and extracting keywords from the second language technical literature data set to obtain a second language keyword set;
the keyword set translation module is used for carrying out first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain a target translation keyword set;
The related word pair acquisition module is used for acquiring at least one potential hidden related word pair and the related degree of the words in each potential hidden related word pair according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword; wherein, the two keywords in the hidden associated word pair are only from the first language technical literature data set and the second language technical literature data set respectively; the language attribute represents the language type of the scientific literature data set from which the key words are derived;
and the association word pair screening module is used for screening each potential hidden association word pair according to each association degree to obtain a target hidden association word pair.
In an alternative embodiment of the present application, the keyword set translation module is specifically configured to:
translating each second language keyword in the second language keyword set into the first language according to the bilingual scientific dictionary corresponding to the first language and the second language, and obtaining an initial translated text keyword set; wherein each second language keyword corresponds to at least one translation keyword in the initial translation keyword set;
filtering translation keywords which are not in the first language keyword set in the initial translation keyword set to obtain an intermediate translation keyword set;
And screening out unique translation keywords corresponding to each second language keyword in the intermediate translation keyword set to obtain target translation keywords.
In an alternative embodiment of the present application, the keyword set translation module is specifically configured to:
for each second language keyword in the second language keyword set, acquiring a context word vector corresponding to the second language keyword according to the text containing the second language keyword in the second language scientific literature data set;
acquiring word vectors of each translation keyword according to each translation keyword corresponding to the second language keyword in the intermediate translation set;
according to the word vector and the context word vector of each translation keyword, obtaining the similarity between the word vector of each translation keyword and the context word vector corresponding to the second language keyword;
and screening out the translation keywords corresponding to the maximum similarity according to the similarities as unique translation keywords.
In an alternative embodiment of the application, the co-occurrence relationship comprises: a first order co-occurrence relationship and a second order co-occurrence relationship;
the related word pair acquisition module is specifically used for:
obtaining first-order co-occurrence word pairs according to first-order co-occurrence relations among keywords in the first language keyword set and the target translation keyword set;
Obtaining a second-order co-occurrence word pair according to the second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
and obtaining potential hidden associated word pairs according to the first-order co-occurrence word pairs, the second-order co-occurrence word pairs and the language attribute of each keyword in each word pair.
In an alternative embodiment of the present application, the related word pair obtaining module is specifically configured to:
acquiring a keyword association path corresponding to each keyword according to the first-order co-occurrence word pair and the second-order co-occurrence word pair; the keyword association path comprises keywords, first-order co-occurrence words corresponding to the keywords, second first-order co-occurrence words and first second-order co-occurrence words; the second first-order co-occurrence word and the first second-order co-occurrence word form a first-order co-occurrence word pair;
acquiring a target keyword association path according to the language attribute of each keyword in each keyword association path;
and for each target keyword association path, acquiring potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word.
In an alternative embodiment of the present application, the related word pair obtaining module is specifically configured to:
according to the language attribute of each keyword in each keyword association path, screening out the keyword association paths meeting the preset language attribute screening rule in the keyword association paths as target keyword association paths;
The preset language attribute screening rule comprises the following steps:
determining that the keywords are simultaneously derived from the first language and technology literature data set and the second language and technology literature data set according to the language attribute of the keywords;
according to the language attribute of the first-order co-occurrence word and the language attribute of the first second-order co-occurrence word, determining that the first-order co-occurrence word and the first second-order co-occurrence word are respectively only derived from the first language-technology literature data set and the second language-technology literature data set.
In an alternative embodiment of the present application, the related word pair obtaining module is specifically configured to:
obtaining first-order co-occurrence probability of the keyword and the first-order co-occurrence word, and obtaining second-order co-occurrence probability of the keyword and the first second-order co-occurrence word;
and obtaining the association degree of the potential hidden associated words to the words in the medium according to the product of the first-order co-occurrence probability and the second-order co-occurrence probability.
In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored on the memory, and the processor executes the computer program to implement the steps of the cross-language implicit association knowledge discovery method provided in any one of the foregoing embodiments.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the cross-language implicit associated knowledge discovery method provided in any one of the above embodiments.
The technical scheme provided by the embodiment of the application has the beneficial effects that:
according to the scheme, the second language keyword set is converted into the first language through the first language translation and word sense disambiguation, so that the target translation keyword set is obtained, each target translation keyword in the target translation keyword set can be ensured to be the unique first language translation corresponding to the second language keyword, and the problems of ambiguity and one-word multi-translation of keywords among different languages are solved. The co-occurrence relation among the keywords and the language attribute of the keywords are combined, and the target implicit association word pairs are obtained in a mode of language attribute screening and association degree dual screening, so that deep knowledge association which is not yet perceived in different languages can be obtained, word pairs with implicit association in different languages can be extracted more accurately, and the reliability of the implicit association among the obtained knowledge is improved effectively.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic flow chart of a cross-language implicit associated knowledge discovery method provided by an embodiment of the application;
Fig. 2 is a schematic diagram of an english keyword translation and disambiguation flow provided by an embodiment of the application;
FIG. 3 is a schematic diagram of a target keyword association path type according to an embodiment of the present application;
FIG. 4 is a flowchart of a cross-language implicit associated knowledge discovery method provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a data processing flow according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a cross-language implicit associated knowledge discovery device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device for cross-language implicit association knowledge discovery according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
The following description of the terminology and related art related to the application:
the knowledge discovery model of early LBD is an ABC model, i.e. when two non-directly related knowledge entities a and C are simultaneously associated with knowledge entity B, an implicit association between a and C can be considered.
Along with the increase of data volume and the complexity of knowledge association, the requirement of knowledge discovery of a complex knowledge network is difficult to be met by simple linear reasoning, so that an expert provides suggestions for omitting a node B, changing A-B-C unidirectional knowledge search discovery into A and C bidirectional search for knowledge discovery, and a AnC model is provided by adopting a mode of expanding the node B.
The AnC model considers that B is not a single knowledge entity, but a knowledge chain composed of multiple knowledge entities, A-B-C is expanded into the form A- (B1-B2-B3- … -Bn) -C. The AnC model can be combined with graph theory to better meet the requirement of knowledge discovery in a complex knowledge network, and is widely applied to LBD research.
The related technology of the LBD method can be divided into a method based on co-occurrence analysis, a method based on semantic analysis, a method based on a knowledge graph and a method 4 fused with artificial intelligence technology, and the detailed description of the specific scheme and the existing defects of the related technology is as follows:
The co-occurrence analysis-based method discovers potential association between knowledge entities by analyzing the co-occurrence relationship between the knowledge entities, and has the advantages of simplicity, intuitiveness and easiness in calculation.
However, since the method based on co-occurrence analysis needs to use a keyword co-occurrence matrix, and the keyword co-occurrence matrix is a sparse matrix, when the analyzed data set becomes large, a huge sparse matrix is obtained, and the calculation cost of the huge sparse matrix is large and even cannot be calculated. Thus, current methods based on co-occurrence analysis are only suitable for simpler knowledge discovery on small-scale data sets.
Semantic analysis based method: semantic relation among knowledge entities is further defined by means of semantic filtering, knowledge discovery rule base construction, semantic vector construction and the like on candidate knowledge items, and therefore correlation among knowledge is accurately and comprehensively revealed.
The method based on the knowledge graph comprises the following steps: and carrying out knowledge discovery by utilizing rich entities in the knowledge graph and semantic relations among the entities and by means of graph mining algorithms such as path mining, subgraph mining, link prediction and the like.
The method for fusing artificial intelligence technology comprises the following steps: knowledge entities are automatically mined from unstructured text by means of machine learning, deep learning and other technologies, and knowledge discovery quality can be improved by combining with methods such as path mining, link prediction and the like.
Although the data-driven knowledge discovery by combining the artificial intelligence technology with the knowledge graph is a research hotspot of the current LBD method with the rapid development of the new generation artificial intelligence technology represented by deep learning, in practical application, a proper technical method needs to be selected in combination with specific application scenes and requirements.
Since the LBD method based on semantic analysis and the LBD method based on knowledge graph depend on ontology, expert resource or knowledge graph respectively, and the method for fusing artificial intelligence depends on a large amount of high-quality labeling data, bilingual ontology, expert resource, knowledge graph or labeling data in the current full-technology field are difficult to acquire.
Therefore, although the LBD method can be theoretically applied to any discipline field, the LBD method is mostly oriented to specific fields, such as the life medicine field and the book information field, and the knowledge discovery effect of the LBD method in other fields is poor due to the influence of data, tool software and the development degree of discipline informatics.
In addition, most of the LBD technology at present focuses on single language knowledge literature resources, and research and development of cross-language knowledge discovery related technology from different languages are rarely carried out. The problem of unbalanced knowledge distribution exists among different languages, and how to enable knowledge to cross the gap of the languages and promote the cross-language knowledge fusion and sharing is a problem to be solved urgently.
Aiming at least one technical problem or the place needing improvement existing in the related technology, the application provides a cross-language implicit associated knowledge discovery method scheme.
The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.
Fig. 1 is a schematic flow chart of a cross-language implicit associated knowledge discovery method provided by an embodiment of the present application, and as shown in fig. 1, the embodiment of the present application provides a cross-language implicit associated knowledge discovery method, including:
step S101, keyword extraction is performed on the first language and technology literature data set to obtain a first language keyword set, and keyword extraction is performed on the second language and technology literature data set to obtain a second language keyword set.
The language is a way for human to communicate, including Chinese, english, spanish, russian, arabic, french, etc. The scientific and technological literature data set comprises texts corresponding to the published scientific and technological achievements, such as paper texts, patent texts, fund project texts, copybook texts, research report texts and the like.
Specifically, specific types (such as chinese, english, etc.) of a first language and a second language that need to be discovered by cross-language implicit associated knowledge in the embodiment are determined, and a first language technical literature data set and a second language technical literature data set are obtained. Wherein "first" and "second" in the first language and the second language are used only to distinguish specific types of languages, and have no actual meaning.
Keyword extraction is carried out on the first language and technology literature data set to obtain a first language keyword set, and keyword extraction is carried out on the second language and technology literature data set to obtain a second language keyword set.
For example, knowledge discovery results may be inaccurate or difficult to understand because meaningless keywords are extracted at the time of keyword extraction. In order to ensure that the extracted keywords are meaningful scientific and technological words (namely, scientific and technological terms), the accuracy and the interpretability of the knowledge discovery result are further ensured. And extracting keywords from the title and abstract of each text in the first language and technology literature data set by using the first language and technology dictionary.
The method comprises the steps of constructing a first language and technology dictionary, utilizing the first language and technology dictionary to segment the title of each text in a first language and technology literature data set and sentences in abstracts, obtaining all possible segmentation positions, constructing a directed acyclic graph (DAG, directed acyclic graph) according to all possible segmentation positions, searching a maximum probability path in the directed acyclic graph through a dynamic programming (Dynamic Programming, DP) algorithm, finding a maximum segmentation combination based on word frequency according to the maximum probability path, outputting sentences segmented into words, and finally extracting keywords from the sentences segmented into words through constructing a co-occurrence relation graph of the words based on a TextRank algorithm (a graph-based ordering algorithm for the text).
It is understood that in addition to the TextRank algorithm, the keyword extraction algorithm such as TF-IDF (Term Frequency-inverse document Frequency) algorithm, RAKE (Rapid Automatic Keyword Extraction, fast automatic keyword extraction) algorithm, and LDA (Latent Dirichlet Allocation ) algorithm may be used in the keyword extraction.
In addition, the second language keyword may be obtained by using the keyword extraction method that is the same as the first language keyword, and specific method steps thereof are not described herein.
Step S102, performing first language translation and word sense disambiguation processing on each second language keyword in the second language keyword set to obtain a target translation keyword set.
Specifically, the second language keywords and the first language keywords are mapped to the same semantic space (first language) by performing translation and word sense disambiguation processing on the second language keywords.
And translating each second language keyword in the second language keyword set into the first language, and realizing association of vocabulary in the first language and vocabulary in the second language in a translation mode. Because each second language keyword may have a plurality of corresponding first language translations, word sense disambiguation processing is also needed to solve the problem of word multiple translations. And ensuring that each second language keyword can obtain a unique target translation keyword, and obtaining a target translation keyword set according to the target translation keyword corresponding to each second language keyword.
Step S103, obtaining at least one potential hidden associated word pair and the association degree of words in each potential hidden associated word pair according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword; wherein, the two keywords in the hidden associated word pair are only from the first language technical literature data set and the second language technical literature data set respectively; the language attribute characterizes the language type of the scientific literature dataset from which the keyword originated.
Specifically, keyword co-occurrence (i.e., first order co-occurrence of keywords) refers to two keywords appearing in the same scientific literature (or a piece of text of a preset length). Any keyword and its first-order co-occurrence word can form a first-order co-occurrence word pair, and the frequency of co-occurrence of the keyword and the first-order co-occurrence word is the co-occurrence frequency.
Generally, when keywords are co-located, the two keywords will typically appear in the same sentence, and a contextual relationship will exist between the keywords. For example: the two keywords "apple" and "mango" co-occur in the sentence "apple and mango are both fruit varieties with excellent mouthfeel, and they have rich nutritional value". In this sentence, "apple" and "mango" are both included in "fruit" and belong to a parallel relationship. At this point, the "mango" may be referred to as the first order co-occurrence of "apple".
In order to further realize deeper implicit associated knowledge discovery, the co-occurrence relationship between the keywords comprises an n-order co-occurrence relationship of the keywords, wherein n is a positive integer greater than or equal to 1.
Taking the second-order co-occurrence of a keyword as an example, for any keyword, the second-order co-occurrence word refers to the first-order co-occurrence word of the keyword, and the first-order co-occurrence relationship (i.e., the co-occurrence frequency is 0) does not exist between the keyword and the second-order co-occurrence word. It can be understood that the multi-order co-occurrence of the keywords can be obtained by analogy based on the second-order co-occurrence determination method of the keywords, which is not described herein.
It should be noted that, the co-occurrence relationship between the keywords in the target translation keyword set is determined according to the co-occurrence relationship between the keywords in the second language keyword set.
The language attribute characterizes the language type of the scientific literature dataset from which the keyword originated.
The method can extract the keywords and simultaneously label the language attribute of the keywords, and determine the language attribute according to the language attribute label.
For example, the first language keyword label L1 (the language type of the technical literature data set from which the keyword is derived is a first language), the second language keyword label L2 (the language type of the technical literature data set from which the keyword is derived is a second language), the target translation keyword is also labeled L2, and the label of the same keyword in the first language keyword set and the target translation keyword set is replaced with BOTH (the language type of the technical literature data set from which the keyword is derived includes the first language and the second language).
And after the first language keyword set, the second language keyword set and the target translation keyword set are acquired, the keywords are labeled according to the differences of the elements in the sets, and the language attribute is determined according to the language attribute label.
For example, an intersection of the first language keyword set and the target translation keyword set is obtained, an independent first language word set is obtained according to a difference between the first language keyword set and the intersection, and an independent second language word set is obtained according to a difference between the target translation keyword set and the intersection. BOTH is labeled for each keyword in the intersection, L1 is labeled for each keyword in the separate first language word set, and L2 is labeled for each keyword in the separate second language word set.
After the co-occurrence relation and the language attribute of the keywords are determined, at least one potential hidden associated word pair is obtained according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword.
It can be appreciated that, in order to implement cross-language implicit association discovery, it is necessary to determine a potential implicit association word pair, and ensure that two keywords in the potential implicit association word pair are derived from only the first language technical literature data set and the second language technical literature data set, respectively.
After at least one potential hidden associated word pair is determined, the association degree of words in each potential hidden associated word pair is obtained. It will be appreciated that the degree of association of potential implicitly associated words with terms in a scientific literature (including scientific literature in a first language and a second language) can be determined by a variety of statistical measures such as mutual information (Mutual Information, MI) and point mutual information (Pointwise Mutual Information, PMI).
For example, the candidate associated word pairs can be formed by using the acquired keywords and third-order co-occurrence words of the keywords, and the candidate associated word pairs are screened through language attributes to obtain potential hidden associated word pairs. For any potential hidden associated word pair, a corresponding third-order co-occurrence path (a keyword C1-keyword C2-keyword C3-keyword C4) is determined, wherein the keyword C1 and the keyword C4 in the path form the potential hidden associated word pair, the keyword C1 and the keyword C2 form a first-order co-occurrence word pair, the keyword C2 and the keyword C3 form a first-order co-occurrence word pair, and the keyword C3 and the keyword C4 form the first-order co-occurrence word pair. Mutual information of each first-order co-occurrence word pair is acquired, the transitivity of the mutual information is used, namely the association degree between the keyword C1 and the keyword C4 is deduced through the keyword C2 and the keyword C3 as bridges, and the conditional mutual information (Conditional Mutual Information, CMI) between the keyword C1 and the keyword C4 is calculated to serve as the association degree of the potential implicit association word pair.
It should be noted that, according to actual requirements, other methods may be adopted to obtain the association degree of the potential hidden association word pair and the potential hidden association word pair.
Step S104, each potential hidden associated word pair is screened according to each association degree, and the target hidden associated word pair is obtained.
Specifically, because at least one obtained potential hidden associated word pair does not have stronger relevance to the words in all the potential hidden associated word pairs, the potential hidden associated word pairs also need to be screened according to the degree of relevance to obtain target hidden associated word pairs.
For example, the potential hidden associated word pairs may be ranked in order of the degree of association from high to low according to the degree of association corresponding to the potential hidden associated word pairs, and the first k potential hidden associated word pairs are selected as the target hidden associated word pairs.
Or selecting the potential hidden associated word pair with the association degree larger than a preset threshold value as a target hidden associated word pair according to the association degree corresponding to the potential hidden associated word pair.
It can be appreciated that, in the process of acquiring and screening the potential hidden associated word pairs, the keywords in the acquired target hidden associated word pairs are the keywords in the first language keyword set and the target translation keyword set, which are processed.
According to the technical scheme provided by the embodiment, through the first language translation and word sense disambiguation processing of the second language keyword set, the second language keyword is converted into the first language to obtain the target translation keyword set, each target translation keyword in the target translation keyword set can be ensured to be the only first language translation corresponding to the second language keyword, and the problems of ambiguity and one-word multiple translation of keywords among different languages are solved. The co-occurrence relation among the keywords and the language attribute of the keywords are combined, and the target implicit association word pairs are obtained in a mode of language attribute screening and association degree dual screening, so that deep knowledge association which is not yet perceived in different languages can be obtained, word pairs with implicit association in different languages can be extracted more accurately, and the reliability of the implicit association among the obtained knowledge is improved effectively.
Further, the technical scheme provided by the embodiment can be applied to any subject field and subject knowledge discovery service, the knowledge transfer among the technical fields and languages is tracked, the fusion and sharing of cross-language knowledge are promoted, and the cross innovation among different fields and different languages is effectively supported.
In an optional embodiment of the present application, performing a first language translation and word sense disambiguation process on each second language keyword in the second language keyword set to obtain a target translated text keyword set, specifically including:
translating each second language keyword in the second language keyword set into the first language according to the bilingual scientific dictionary corresponding to the first language and the second language, and obtaining an initial translated text keyword set; wherein each second language keyword corresponds to at least one translation keyword in the initial translation keyword set;
filtering translation keywords which are not in the first language keyword set in the initial translation keyword set to obtain an intermediate translation keyword set;
and screening out unique translation keywords corresponding to each second language keyword in the intermediate translation keyword set to obtain target translation keywords.
Specifically, taking a first language as a Chinese language and a second language as an English example, the specific steps of performing first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain the target translated text keyword set in this embodiment are illustrated.
To achieve cross-language knowledge discovery, links and mappings are required for the first language keywords and the second language keywords. Fig. 2 is a schematic diagram of an english keyword translation and disambiguation flow provided in an embodiment of the application, as shown in fig. 2, using a bilingual technical dictionary (i.e., a chinese-english technical dictionary) corresponding to chinese and english as a basic resource, translating each english keyword in an english keyword set into chinese by means of translation, and obtaining a translated keyword corresponding to each english keyword, thereby obtaining an initial translated keyword set.
Because each English keyword corresponds to at least one translation keyword, disambiguation treatment is needed when the English keyword corresponds to a plurality of translation keywords, and each English keyword is ensured to correspond to a unique translation keyword.
Considering that scientific and technological literature writing of different languages has specific language habits, in order to preferentially select words conforming to the writing habits of the scientific and technological literature, the Chinese keyword set extracted from the Chinese technical literature (namely Chinese corpus) is utilized to filter the initial translated text keywords, namely, all translated text keywords which are not present in the Chinese keyword set are deleted, so that an intermediate translated text keyword set is obtained.
Because the condition that one English keyword corresponds to a plurality of translated keywords still exists in the intermediate translated keyword set obtained after the filtering processing, screening processing is further needed to be carried out on the intermediate translated keyword set, and unique translated keywords corresponding to each English keyword in the intermediate translated keyword set are screened out to obtain target translated keywords.
It can be appreciated that, during the screening process, the unique translation keywords corresponding to each english keyword can be determined by manual screening by an expert, screening according to standard translation widely accepted and approved in the industry, or screening according to context analysis.
According to the technical scheme provided by the embodiment, through the first language translation and word sense disambiguation processing of the second language keyword set, the first language keyword set is used for filtering the translated keywords, and further the unique translated keywords are screened, so that each target translated keyword in the target translated keyword set is the unique first language translation corresponding to the second language keyword, the problems of ambiguity and word multiple translation of keywords in different languages are solved, the consistency of keywords in the cross-language knowledge discovery process is ensured, the problems of ambiguity and confusion are solved, the probability of implicit association error matching between knowledge is reduced, and the reliability of implicit association between acquired knowledge is effectively improved.
In an alternative embodiment of the present application, screening out a unique translation keyword corresponding to each second language keyword in the intermediate translation keyword set specifically includes:
for each second language keyword in the second language keyword set, acquiring a context word vector corresponding to the second language keyword according to the text containing the second language keyword in the second language scientific literature data set;
acquiring word vectors of each translation keyword according to each translation keyword corresponding to the second language keyword in the intermediate translation set;
According to the word vector and the context word vector of each translation keyword, obtaining the similarity between the word vector of each translation keyword and the context word vector corresponding to the second language keyword;
and screening out the translation keywords corresponding to the maximum similarity according to the similarities as unique translation keywords.
Specifically, referring to fig. 2 again, taking the first language as chinese and the second language as english as an example, the specific step of screening out the unique translation keyword corresponding to each second language keyword in the intermediate translation keyword set in this embodiment is illustrated.
For any English keyword in the English keyword set, extracting text containing the English keyword in the second language and technology literature data set, further obtaining context word vectors corresponding to the English keyword, and obtaining word vectors of each translation keyword according to each translation keyword corresponding to the English keyword in the intermediate translation set.
For example, by acquiring a bilingual pre-training model (i.e., a chinese-english pre-training model), a text containing the english keyword may be input into the bilingual pre-training model, and a context word vector corresponding to the english keyword may be output. For any translation keyword, word2vec (Word to vector) model is used to characterize the translation keyword as a Word vector.
In addition, the GloVe (Global Vectors) model, ELMo (Embedding from Language Models) model, fastText model, and the like can be used to obtain word vectors.
And for any second language keyword and a plurality of corresponding translated keywords, obtaining a context word vector corresponding to the second language keyword and a word vector of each translated keyword, calculating the similarity of the word vector of each translated keyword and the context word vector corresponding to the second language keyword, and screening out the translated keyword corresponding to the maximum similarity as a unique translated keyword according to each similarity.
It is understood that the similarity of the word vectors in this embodiment can be measured by a similarity coefficient such as euclidean Distance (euclidean Distance), cosine Distance (Cosine Distance), or mahalanobis Distance (Mahalanobis Distance).
It should be understood that, the context-based screening step is to screen out a unique translation keyword corresponding to each second language keyword, and if only a unique translation keyword in the intermediate translation set corresponds to the existing second language keyword, the translation keyword may be directly considered as a target keyword without performing the screening step.
According to the technical scheme provided by the embodiment, through translating the second language keyword set through the first language, filtering the translated keywords through the first language keyword set, further screening the translated keywords based on the context, and realizing disambiguation of the translated keywords through the filtering and screening steps, each target translated keyword in the target translated keyword set can be guaranteed to be the only first language translation corresponding to the second language keyword, the ambiguity and one-word multi-translation problems of keywords among different languages are solved, consistency of keywords in the cross-language knowledge discovery process is guaranteed, the ambiguity and confusion problems are solved, the probability of implicit association error matching among knowledge is reduced, and the reliability of implicit association among acquired knowledge is effectively improved.
In an alternative embodiment of the application, the co-occurrence relationship comprises: a first order co-occurrence relationship and a second order co-occurrence relationship;
the acquisition method of the potential hidden associated word pair comprises the following steps:
obtaining first-order co-occurrence word pairs according to first-order co-occurrence relations among keywords in the first language keyword set and the target translation keyword set;
obtaining a second-order co-occurrence word pair according to the second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
And obtaining potential hidden associated word pairs according to the first-order co-occurrence word pairs, the second-order co-occurrence word pairs and the language attribute of each keyword in each word pair.
Specifically, the co-occurrence relationship includes: a first order co-occurrence relationship and a second order co-occurrence relationship. The embodiment realizes the acquisition of the potential hidden associated word pairs by means of the first-order co-occurrence relationship and the second-order co-occurrence relationship among the keywords and combining the language attribute of the keywords.
According to the co-occurrence condition of each keyword in the first language keyword set in each text in the first language technical literature data set and the co-occurrence condition of each keyword in the second language keyword set in each text in the second language technical literature data set, a first-order co-occurrence relation between each keyword in the first language keyword set and each keyword in the target translation keyword set is obtained, and a first-order co-occurrence word pair is obtained.
And acquiring a second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set based on all the first-order co-occurrence word pairs, and acquiring a second-order co-occurrence word pair.
And obtaining potential hidden associated word pairs according to the first-order co-occurrence word pairs, the second-order co-occurrence word pairs and the language attribute of each keyword in each word pair. It can be appreciated that the acquisition mode of the potential implicit related word pairs can be determined according to actual requirements.
For example, for any keyword a, a second-order co-occurrence word A1 derived from only the first language-technology document data set and a second-order co-occurrence word A2 derived from only the second language-technology document data set may be obtained according to the language attribute, and the potential implicitly associated word pairs may be formed by A1 and A2.
Or, for any keyword A, according to language attribute, obtaining a second-order co-occurrence word A1 only from the first language and technology literature data set and a first-order co-occurrence word A3 only from the second language and technology literature data set, and forming A1 and A3 into a potential hidden associated word pair.
Or, for any keyword A, according to language attribute, obtaining a first-order co-occurrence word A4 only from the first language and technology literature data set and a first-order co-occurrence word A3 only from the second language and technology literature data set, and forming A3 and A4 into a potential hidden associated word pair.
According to the technical scheme provided by the embodiment, the method for acquiring the potential hidden associated word pairs by combining the first-order co-occurrence relation and the second-order co-occurrence relation of the keywords with the language attribute of the keywords only needs to consider the adjacent words related to any keyword, and compared with the co-occurrence relation of more orders, the first-order co-occurrence relation and the second-order co-occurrence relation are easier to acquire, so that the calculated amount can be effectively reduced, and the method is applicable to large-scale data processing. And the association between words can be captured better by adopting the low-order co-occurrence relationship, so that the accuracy of acquiring the word implicit association relationship is improved, and the reliability of the implicit association between acquired knowledge is effectively improved.
In an alternative embodiment of the present application, the obtaining a potential implicit related word pair according to the first order co-occurrence word pair, the second order co-occurrence word pair and the language attribute of each keyword in each word pair specifically includes:
acquiring a keyword association path corresponding to each keyword according to the first-order co-occurrence word pair and the second-order co-occurrence word pair; the keyword association path comprises keywords, first-order co-occurrence words corresponding to the keywords, second first-order co-occurrence words and first second-order co-occurrence words; the second first-order co-occurrence word and the first second-order co-occurrence word form a first-order co-occurrence word pair;
acquiring a target keyword association path according to the language attribute of each keyword in each keyword association path;
and for each target keyword association path, acquiring potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word.
Specifically, a keyword association path can be formed among any keyword S, its first-order co-occurrence word and second-order co-occurrence word. The keyword association path comprises a keyword S, a first-order co-occurrence word W1 corresponding to the keyword S, a second first-order co-occurrence word W2 and a first second-order co-occurrence word W3; the second first-order co-occurrence word W2 and the first second-order co-occurrence word W3 form a first-order co-occurrence word pair. And according to all the first-order co-occurrence word pairs and the second-order co-occurrence word pairs, a plurality of keyword association paths can be acquired.
Further, since the two keywords in the potential implicit related word pair are derived from the first language technical literature data set and the second language technical literature data set, respectively. Therefore, it is necessary to screen each keyword-associated path according to the language attribute of each keyword in the keyword-associated path, and acquire a target keyword-associated path.
For example, the keyword association path of the keyword S may be represented as W1-S-W2-W3. And screening out keyword association paths of W1 and W3 from the first language and technology literature data set and the second language and technology literature data set respectively as target keyword association paths.
It can be understood that, besides the above keyword association path filtering rule, the language attribute of the keyword S and the second first-order co-occurrence word W2 may be limited according to actual requirements.
And for each target keyword association path, acquiring potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word.
According to the technical scheme provided by the embodiment, the method for acquiring the potential hidden associated word pairs by combining the first-order co-occurrence relation and the second-order co-occurrence relation of the keywords with the language attribute of the keywords only needs to consider the adjacent words related to any keyword, and compared with the co-occurrence relation of more orders, the first-order co-occurrence relation and the second-order co-occurrence relation are easier to acquire, so that the calculated amount can be effectively reduced, and the method is applicable to large-scale data processing. And the method of combining the first-order co-occurrence relationship and the second-order co-occurrence relationship can better capture the association between words, improve the accuracy of acquiring the word implicit association relationship, and effectively improve the reliability of the implicit association between acquired knowledge.
In an optional embodiment of the present application, obtaining a target keyword association path according to a language attribute of each keyword in each keyword association path specifically includes:
according to the language attribute of each keyword in each keyword association path, screening out the keyword association paths meeting the preset language attribute screening rule in the keyword association paths as target keyword association paths;
the preset language attribute screening rule comprises the following steps:
determining that the keywords are simultaneously derived from the first language and technology literature data set and the second language and technology literature data set according to the language attribute of the keywords;
according to the language attribute of the first-order co-occurrence word and the language attribute of the first second-order co-occurrence word, determining that the first-order co-occurrence word and the first second-order co-occurrence word are respectively only derived from the first language-technology literature data set and the second language-technology literature data set.
Specifically, according to the language attribute of each keyword in each keyword association path, screening each keyword association path by adopting a preset language attribute screening rule, and screening out the keyword association paths meeting the preset language attribute screening rule in the keyword association paths as target keyword association paths.
The following example is used to describe the preset language attribute screening rules.
And labeling any keyword according to the language attribute of the keyword, wherein the labeling types comprise L1, L2 and BOTH. Wherein, the notation L1 indicates that the language type of the technical literature data set from the keyword is a first language, the notation L2 indicates that the language type of the technical literature data set from the keyword is a second language, and the notation BOTH indicates that the language type of the technical literature data set from the keyword includes the first language and the second language.
The keyword S, the first-order co-occurrence word W1 corresponding to the keyword S, the second first-order co-occurrence word W2 and the first second-order co-occurrence word W3 form a keyword association path which is expressed as W1-S-W2-W3.
In a target keyword association path meeting a preset language attribute screening rule, the keyword S is required to be ensured to be labeled with BOTH, and any one label L1 of the first-order co-occurrence word W1 and any other label L2 of the first second-order co-occurrence word W3 are required to be labeled.
It can be understood that, fig. 3 is a schematic diagram of a target keyword association path type provided in the embodiment of the present application, as shown in fig. 3, since the labeling of the second first-order co-occurrence word W2 is not limited, there are three types of target keyword association paths, and the second first-order co-occurrence word W2 in the three types is labeled with L2, L1 and BOTH respectively.
It will be appreciated that the keyword a and the first co-occurrence word b form a first-order co-occurrence word pair (a, b), and the number of co-occurrences of the keyword a and the first-order co-occurrence word b is the co-occurrence frequency (Cooccurrence frequency), denoted as cofreq (a, b).
In type 1, type 2 and type 3, the co-occurrence frequency of each first-order co-occurrence word pair satisfies the following condition:
cofreq(S,W1)≠0
cofreq(S,W2)≠0
cofreq(S,W3)=0
cofreq(W1,W2)=0
cofreq(W2,W3)≠0
according to the technical scheme provided by the embodiment, a keyword association path is obtained by combining a first-order co-occurrence relationship and a second-order co-occurrence relationship, the keyword association path is screened according to the language attribute of keywords in the keyword association path to obtain a target keyword association path, then a hidden association word pair is obtained, the target hidden association word pair is obtained in a mode of language attribute screening and association degree double screening, on the basis of limiting the language attribute of the words in the hidden association word pair, the language attribute of the association keywords which are used for linking the first-order association word pair and the second-order association word pair together in the association path is further limited, association of hidden association keywords among words in different languages can be better captured, acquisition of deep knowledge association which is not yet perceived in different languages is realized, accuracy of word hidden association relationship acquisition is improved, and reliability of hidden association among acquired knowledge is effectively improved.
In an optional embodiment of the present application, the method for obtaining the association degree of the potential implicit association word to the medium word includes:
obtaining first-order co-occurrence probability of the keyword and the first-order co-occurrence word, and obtaining second-order co-occurrence probability of the keyword and the first second-order co-occurrence word;
and obtaining the association degree of the potential hidden associated words to the words in the medium according to the product of the first-order co-occurrence probability and the second-order co-occurrence probability.
Specifically, the embodiment obtains the association degree of the potential hidden associated word to the middle word based on the co-occurrence probability. And determining first-order co-occurrence word pairs and the co-occurrence frequency of each first-order co-occurrence word pair through the co-occurrence relation among the keywords, and calculating the co-occurrence probability of each first-order co-occurrence word pair through the co-occurrence frequency.
The set of first-order co-occurrence words of the keyword S is recorded asIt can be expressed as:
wherein N is a setThe number of Chinese words, namely the number of first-order co-occurrence words of the key words S. />Is a specific first-order co-occurrence word of the keyword S.
The first order co-occurrence probability for word a and word b is defined as copob (a, b).
The specific calculation formula of the CoProb (a, b) is as follows:
where freq (a) and freq (b) refer to the total number of occurrences of words a and b, respectively, in the entire dataset. cofreq (a, b) refers to the co-occurrence frequency of words a and b, i.e., the number of co-occurrences of words a and b in a dataset.
The second-order co-occurrence word of the keyword S is based on a first-order co-occurrence word setAnd (5) obtaining. The set of second order co-occurrence words of the keyword S is denoted +.>It can be expressed as:
}
wherein M is the number of second-order co-occurrence words of the keyword S, and wordsIs a specific second-order co-occurrence word of the keyword S.
Defining the second order co-occurrence probability of word a and word c as CoProb 2 (a, c), word a and word c have no direct co-occurrence relationship.
CoProb 2 The specific calculation formula of (a, c) is:
wherein B is a Is a set of first order co-occurrence words of word a, B c Is a set of first order co-occurrence words of word c.
The keyword S, the first-order co-occurrence word W1 corresponding to the keyword S, the second first-order co-occurrence word W2 and the first second-order co-occurrence word W3 form a keyword association path which is expressed as W1-S-W2-W3. Potential implicit related words W1-W3 can be determined based on the keyword association path.
According to the first-order co-occurrence probability calculation formula, the first-order co-occurrence probability copob (S, W1) of the keyword S and the first-order co-occurrence word W1, the first-order co-occurrence probability copob (S, W2) of the keyword S and the second-order co-occurrence word W2, and the first-order co-occurrence probability copob (W2, W3) of the second-order co-occurrence word W2 and the first-order co-occurrence word W3 can be obtained.
Substituting the first-order co-occurrence probability CoProb (S, W2) and the first-order co-occurrence probability CoProb (W2, W3) into the second-order co-occurrence probability calculation formula to obtain the second-order co-occurrence probability CoProb of the keyword S and the first second-order co-occurrence word W3 2 (S,W3)。
The degree of association of the potential implicit association words with the words in W1-W3 is defined as RelScore (W1, W3).
The specific calculation formula of RelScore (W1, W3) is:
wherein Bs is a set of first-order co-occurrence words of the keyword S, B w3 Is a set of first order co-occurrence words of the first second order co-occurrence word W3.
Substituting the product CoProb2 (S, W3) of the first-order co-occurrence probability CoProb (S, W1) and the second-order co-occurrence probability into the association degree calculation formula to obtain the association degree RelScore (W1, W3) of the potential recessive association words in the W1-W3.
According to the technical scheme provided by the embodiment, the association degree of the potential hidden association words in the keyword association path W1-S-W2-W3 is defined as the product of the first-order co-occurrence probability of the first-order co-occurrence word pair S-W1 and the second-order co-occurrence probability of the second-order co-occurrence word pair S-W3, the association strength of the words in the potential hidden association word pair can be effectively measured, the target hidden association word pair is screened based on the association degree, screening of hidden association with strong association can be guaranteed, and the reliability of the hidden association between acquired knowledge is effectively improved.
The following describes a specific application of the embodiment of the present application in detail by a specific example:
fig. 4 is a flowchart of a cross-language implicit associated knowledge discovery method according to an embodiment of the present application, where, as shown in fig. 4, specific steps of the cross-language implicit associated knowledge discovery method are illustrated by taking a first language as chinese, a second language as english, and a scientific literature as a patent.
Extracting keywords from titles and abstracts in each text in the Chinese patent data set and the English patent data set respectively to obtain a Chinese keyword set and an English keyword set, translating and disambiguating each English keyword in the English keyword set to obtain a unique corresponding translated keyword, and obtaining a target translated keyword set. And annotating language attributes of each keyword.
Because the data processed in this embodiment is bilingual patent data in the whole field, the data size is huge (tens of millions of levels, text capacity reaches TB level), and conventional memory devices cannot accommodate such a large amount of data for storage and calculation.
Fig. 5 is a schematic diagram of a data processing flow provided in an embodiment of the present application, as shown in fig. 5, a portion of memory is used as a computing platform, data exchange is continuously performed with a database in a computing process to increase a capacity of the computing platform, and a multi-process is used to cooperate with a multi-database file in a data transmission process to increase a reading speed of the database.
When the co-occurrence relation among keywords is obtained, patent data in a Chinese patent data set and English patent data set are split into a plurality of patent data sets (such as 2000 patent data sets), the patent data sets are sequentially loaded into a memory, for each patent, the memory is used for reading all keywords in the patent to form a keyword table through the comparison of the keywords (Chinese keyword set or English keyword set), a text window with a specified length is constructed, co-occurrence keyword pairs in the text window are counted, the co-occurrence frequency is recorded, a first-order Guan Jici pair list is constructed, and the co-occurrence relation is registered into the co-occurrence matrix according to the first-order co-occurrence word pair list.
It is understood that co-occurrence matrix is a two-dimensional matrix used to represent co-occurrence relationships between words. In the co-occurrence matrix, the rows and columns represent different words, respectively, while each element in the matrix represents the co-occurrence frequency of the two words.
Because the co-occurrence matrix is a large-scale sparse matrix, the storage scale of the co-occurrence matrix is the square level of the keyword information quantity, after each patent in the patent data set is processed according to the method, the memory discards the patent data set, loads the next patent data set, and transfers the obtained memory co-occurrence matrix into the database by adopting the lightweight database SQLite. Compared with a large database MySQL, SQLite has a great advantage in reading speed.
And transferring the memory co-occurrence matrix into a database to obtain a corresponding database co-occurrence matrix, splitting the co-occurrence information in the co-occurrence matrix according to words by the database, and storing the split co-occurrence information into co-occurrence information of different words.
Further, in order to increase the speed of reading and storing data from the database, in this embodiment, four sub-databases are used to store a part of co-occurrence information of each word, and taking co-occurrence information of word 1 as an example, a co-occurrence information list corresponding to the co-occurrence information of word 1 is split to obtain four corresponding sub-lists of co-occurrence information, and the four sub-lists of co-occurrence information are stored in the four sub-databases respectively. And in the reading process, simultaneously reading four co-occurrence information sub-list combinations to obtain co-occurrence information of the word 1. By adopting the method, the read-write speed can be increased to 6 w/s.
After all the patent data are processed, a co-occurrence matrix recording the first-order co-occurrence relation among all the keywords is obtained, and the first-order co-occurrence probability of each first-order co-occurrence word pair is calculated. And obtaining all second-order co-occurrence word pairs, and calculating the second-order co-occurrence probability of each second-order co-occurrence word pair.
And obtaining potential hidden associated word pairs according to the first-order co-occurrence word pairs, the second-order co-occurrence word pairs and the language attribute of each keyword, and calculating the association degree of words in each potential hidden associated word pair.
And screening the potential hidden associated word pairs according to the association degree, and selecting the potential hidden associated word pairs with the association degree of 10% in the front as target hidden associated word pairs.
The cross-language implicit association knowledge method provided by the embodiment can be suitable for knowledge discovery requirements of large-scale, full-field and bilingual data sets, can be applied to any subject field and subject knowledge discovery service, can track knowledge transfer among various technical fields and languages by using information in public patent documents, and discovers deep knowledge association which is not yet perceived, so that research personnel are helped not to be limited by own research fields, subjects and languages, fusion and sharing of cross-language knowledge are promoted, and cross innovation among different fields and different languages is effectively supported.
Fig. 6 is a schematic structural diagram of a cross-language implicit association knowledge discovery apparatus according to an embodiment of the present application, as shown in fig. 6, the apparatus 60 may include: a keyword set acquisition module 601, a keyword set translation module 602, an associated word pair acquisition module 603 and an associated word pair screening module 604;
the keyword set obtaining module 601 is configured to perform keyword extraction on the first language technical literature data set to obtain a first language keyword set, and perform keyword extraction on the second language technical literature data set to obtain a second language keyword set;
The keyword set translation module 602 is configured to perform a first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain a target translated keyword set;
the related word pair obtaining module 603 is configured to obtain at least one potential implicit related word pair and a degree of association between words in each potential implicit related word pair according to a co-occurrence relationship between the keywords in the first language keyword set and the target translation keyword set and a language attribute of each keyword; wherein, the two keywords in the hidden associated word pair are only from the first language technical literature data set and the second language technical literature data set respectively; the language attribute represents the language type of the scientific literature data set from which the key words are derived;
and the related word pair screening module 604 is configured to screen each potential implicit related word pair according to each related degree, and obtain a target implicit related word pair.
According to the technical scheme provided by the embodiment, through the first language translation and word sense disambiguation processing of the second language keyword set, the second language keyword is converted into the first language to obtain the target translation keyword set, each target translation keyword in the target translation keyword set can be ensured to be the only first language translation corresponding to the second language keyword, and the problems of ambiguity and one-word multiple translation of keywords among different languages are solved. The co-occurrence relation among the keywords and the language attribute of the keywords are combined, and the target implicit association word pairs are obtained in a mode of language attribute screening and association degree dual screening, so that deep knowledge association which is not yet perceived in different languages can be obtained, word pairs with implicit association in different languages can be extracted more accurately, and the reliability of the implicit association among the obtained knowledge is improved effectively.
The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.
In an alternative embodiment of the present application, the keyword set translation module is specifically configured to:
translating each second language keyword in the second language keyword set into the first language according to the bilingual scientific dictionary corresponding to the first language and the second language, and obtaining an initial translated text keyword set; wherein each second language keyword corresponds to at least one translation keyword in the initial translation keyword set;
filtering translation keywords which are not in the first language keyword set in the initial translation keyword set to obtain an intermediate translation keyword set;
and screening out unique translation keywords corresponding to each second language keyword in the intermediate translation keyword set to obtain target translation keywords.
In an alternative embodiment of the present application, the keyword set translation module is specifically configured to:
For each second language keyword in the second language keyword set, acquiring a context word vector corresponding to the second language keyword according to the text containing the second language keyword in the second language scientific literature data set;
acquiring word vectors of each translation keyword according to each translation keyword corresponding to the second language keyword in the intermediate translation set;
according to the word vector and the context word vector of each translation keyword, obtaining the similarity between the word vector of each translation keyword and the context word vector corresponding to the second language keyword;
and screening out the translation keywords corresponding to the maximum similarity according to the similarities as unique translation keywords.
In an alternative embodiment of the application, the co-occurrence relationship comprises: a first order co-occurrence relationship and a second order co-occurrence relationship;
the related word pair acquisition module is specifically used for:
obtaining first-order co-occurrence word pairs according to first-order co-occurrence relations among keywords in the first language keyword set and the target translation keyword set;
obtaining a second-order co-occurrence word pair according to the second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
and obtaining potential hidden associated word pairs according to the first-order co-occurrence word pairs, the second-order co-occurrence word pairs and the language attribute of each keyword in each word pair.
In an alternative embodiment of the present application, the related word pair obtaining module is specifically configured to:
acquiring a keyword association path corresponding to each keyword according to the first-order co-occurrence word pair and the second-order co-occurrence word pair; the keyword association path comprises keywords, first-order co-occurrence words corresponding to the keywords, second first-order co-occurrence words and first second-order co-occurrence words; the second first-order co-occurrence word and the first second-order co-occurrence word form a first-order co-occurrence word pair;
acquiring a target keyword association path according to the language attribute of each keyword in each keyword association path;
and for each target keyword association path, acquiring potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word.
In an alternative embodiment of the present application, the related word pair obtaining module is specifically configured to:
according to the language attribute of each keyword in each keyword association path, screening out the keyword association paths meeting the preset language attribute screening rule in the keyword association paths as target keyword association paths;
the preset language attribute screening rule comprises the following steps:
determining that the keywords are simultaneously derived from the first language and technology literature data set and the second language and technology literature data set according to the language attribute of the keywords;
According to the language attribute of the first-order co-occurrence word and the language attribute of the first second-order co-occurrence word, determining that the first-order co-occurrence word and the first second-order co-occurrence word are respectively only derived from the first language-technology literature data set and the second language-technology literature data set.
In an alternative embodiment of the present application, the related word pair obtaining module is specifically configured to:
obtaining first-order co-occurrence probability of the keyword and the first-order co-occurrence word, and obtaining second-order co-occurrence probability of the keyword and the first second-order co-occurrence word;
and obtaining the association degree of the potential hidden associated words to the words in the medium according to the product of the first-order co-occurrence probability and the second-order co-occurrence probability.
The embodiment of the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the cross-language implicit associated knowledge discovery method, and compared with the related technology, the steps of the cross-language implicit associated knowledge discovery method can be realized: through the first language translation and word sense disambiguation processing of the second language keyword set, the second language keywords are converted into the first language to obtain the target translation keyword set, each target translation keyword in the target translation keyword set can be ensured to be the only first language translation corresponding to the second language keywords, and the ambiguity and one-word multi-translation problems of keywords among different languages are solved. The co-occurrence relation among the keywords and the language attribute of the keywords are combined, and the target implicit association word pairs are obtained in a mode of language attribute screening and association degree dual screening, so that deep knowledge association which is not yet perceived in different languages can be obtained, word pairs with implicit association in different languages can be extracted more accurately, and the reliability of the implicit association among the obtained knowledge is improved effectively.
In an alternative embodiment, an electronic device is provided, as shown in fig. 7, the electronic device 70 shown in fig. 7 comprising: a processor 701 and a memory 703. The processor 701 is coupled to a memory 703, such as via a bus 702. Optionally, the electronic device 700 may further comprise a transceiver 704, the transceiver 704 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 704 is not limited to one, and the structure of the electronic device 700 is not limited to the embodiment of the present application.
The processor 701 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 701 may also be a combination that performs computing functions, such as including one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 702 may include a path to transfer information between the components. Bus 702 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect Standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. Bus 702 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.
The Memory 703 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory ), a CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.
The memory 703 is used for storing a computer program for executing an embodiment of the present application and is controlled to be executed by the processor 701. The processor 701 is arranged to execute a computer program stored in the memory 703 for carrying out the steps shown in the foregoing method embodiments.
The electronic device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a car-mounted terminal (e.g., car navigation terminal), a wearable device, etc., and a fixed terminal such as a digital TV, a desktop computer, etc.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
The computer readable storage medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.
The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims (7)

1. The cross-language implicit associated knowledge discovery method is characterized by comprising the following steps of:
extracting keywords from the first language technical literature data set to obtain a first language keyword set, and extracting keywords from the second language technical literature data set to obtain a second language keyword set;
performing first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain a target translation keyword set;
obtaining at least one potential hidden associated word pair and the association degree of words in each potential hidden associated word pair according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword; wherein, two keywords in the hidden associated word pair are only from the first language technical literature data set and the second language technical literature data set respectively; the language attribute characterizes the language type of the scientific literature data set from which the key words are derived;
Screening each potential hidden associated word pair according to each association degree to obtain a target hidden associated word pair;
the co-occurrence relationship includes: a first order co-occurrence relationship and a second order co-occurrence relationship;
the method for acquiring the potential hidden associated word pairs comprises the following steps:
acquiring a first-order co-occurrence word pair according to the first-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
obtaining a second-order co-occurrence word pair according to the second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
acquiring a potential hidden associated word pair according to the first-order co-occurrence word pair, the second-order co-occurrence word pair and the language attribute of each keyword in each word pair;
the step of obtaining a potential hidden associated word pair according to the first-order co-occurrence word pair, the second-order co-occurrence word pair and the language attribute of each keyword in each word pair specifically comprises the following steps:
acquiring a keyword association path corresponding to each keyword according to the first-order co-occurrence word pair and the second-order co-occurrence word pair; the keyword association path comprises the keywords, first-order co-occurrence words corresponding to the keywords, second first-order co-occurrence words and first second-order co-occurrence words; the second first-order co-occurrence word and the first second-order co-occurrence word form the first-order co-occurrence word pair;
Acquiring a target keyword association path according to the language attribute of each keyword in each keyword association path;
for each target keyword association path, acquiring the potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word;
the method for acquiring the association degree of the potential hidden associated words to the medium words comprises the following steps:
obtaining first-order co-occurrence probability of the keyword and the first-order co-occurrence word, and obtaining second-order co-occurrence probability of the keyword and the first second-order co-occurrence word;
and obtaining the association degree of the potential hidden associated word to the middle word according to the product of the first-order co-occurrence probability and the second-order co-occurrence probability.
2. The method for finding cross-language implicit associated knowledge according to claim 1, wherein the performing a first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain a target translated keyword set specifically comprises:
translating each second language keyword in the second language keyword set into a first language according to a bilingual scientific dictionary corresponding to the first language and the second language, and obtaining an initial translated text keyword set; wherein each second language keyword corresponds to at least one translation keyword in the initial translation keyword set;
Filtering the translation keywords which are not in the first language keyword set in the initial translation keyword set to obtain an intermediate translation keyword set;
and screening out the unique translation keywords corresponding to each second language keyword in the intermediate translation keyword set to obtain the target translation keywords.
3. The method for finding cross-language implicit associated knowledge according to claim 2, wherein said screening out unique translation keywords corresponding to each of said second language keywords in said set of intermediate translation keywords comprises:
for each second language keyword in the second language keyword set, acquiring a context word vector corresponding to the second language keyword according to the text containing the second language keyword in the second language scientific and technological literature data set;
acquiring word vectors of each translation keyword according to each translation keyword corresponding to the second language keyword in the intermediate translation set;
according to the word vector of each translation keyword and the context word vector, obtaining the similarity between the word vector of each translation keyword and the context word vector corresponding to the second language keyword;
And screening out the translation keywords corresponding to the maximum similarity according to the similarities, and taking the translation keywords as the unique translation keywords.
4. The method for finding cross-language implicit associated knowledge according to any one of claims 1 to 3, wherein the obtaining a target keyword association path according to the language attribute of each keyword in each keyword association path specifically includes:
according to the language attribute of each keyword in each keyword association path, screening out the keyword association paths meeting a preset language attribute screening rule in the keyword association paths as the target keyword association paths;
wherein, the preset language attribute screening rule comprises:
determining that the keywords are simultaneously derived from the first language and technology literature data set and the second language and technology literature data set according to the language attribute of the keywords;
and determining that the first-order co-occurrence word and the first second-order co-occurrence word are respectively only derived from the first language-tech literature data set and the second language-tech literature data set according to the language attribute of the first-order co-occurrence word and the language attribute of the first second-order co-occurrence word.
5. A cross-language implicit associative knowledge discovery apparatus comprising:
the keyword set acquisition module is used for extracting keywords from the first language technical literature data set to obtain a first language keyword set, and extracting keywords from the second language technical literature data set to obtain a second language keyword set;
the keyword set translation module is used for carrying out first language translation and word sense disambiguation on each second language keyword in the second language keyword set to obtain a target translation keyword set;
the associated word pair acquisition module is used for acquiring at least one potential hidden associated word pair and the association degree of words in each potential hidden associated word pair according to the co-occurrence relation between the keywords in the first language keyword set and the target translation keyword set and the language attribute of each keyword; wherein, two keywords in the hidden associated word pair are only from the first language technical literature data set and the second language technical literature data set respectively; the language attribute characterizes the language type of the scientific literature data set from which the key words are derived;
the related word pair screening module is used for screening each potential hidden related word pair according to each related degree to obtain a target hidden related word pair;
The co-occurrence relationship includes: a first order co-occurrence relationship and a second order co-occurrence relationship;
the related word pair acquisition module is specifically configured to:
acquiring a first-order co-occurrence word pair according to the first-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
obtaining a second-order co-occurrence word pair according to the second-order co-occurrence relation between each keyword in the first language keyword set and the target translation keyword set;
acquiring a potential hidden associated word pair according to the first-order co-occurrence word pair, the second-order co-occurrence word pair and the language attribute of each keyword in each word pair;
the related word pair acquisition module is specifically configured to:
acquiring a keyword association path corresponding to each keyword according to the first-order co-occurrence word pair and the second-order co-occurrence word pair; the keyword association path comprises the keywords, first-order co-occurrence words corresponding to the keywords, second first-order co-occurrence words and first second-order co-occurrence words; the second first-order co-occurrence word and the first second-order co-occurrence word form the first-order co-occurrence word pair;
acquiring a target keyword association path according to the language attribute of each keyword in each keyword association path;
For each target keyword association path, acquiring the potential hidden association word pairs according to the first-order co-occurrence word and the first second-order co-occurrence word;
the related word pair acquisition module is specifically configured to:
obtaining first-order co-occurrence probability of the keyword and the first-order co-occurrence word, and obtaining second-order co-occurrence probability of the keyword and the first second-order co-occurrence word;
and obtaining the association degree of the potential hidden associated word to the middle word according to the product of the first-order co-occurrence probability and the second-order co-occurrence probability.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-4.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-4.
CN202310873311.8A 2023-07-17 2023-07-17 Cross-language implicit associated knowledge discovery method, device, equipment and storage medium Active CN116595124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310873311.8A CN116595124B (en) 2023-07-17 2023-07-17 Cross-language implicit associated knowledge discovery method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310873311.8A CN116595124B (en) 2023-07-17 2023-07-17 Cross-language implicit associated knowledge discovery method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116595124A CN116595124A (en) 2023-08-15
CN116595124B true CN116595124B (en) 2023-10-10

Family

ID=87601299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310873311.8A Active CN116595124B (en) 2023-07-17 2023-07-17 Cross-language implicit associated knowledge discovery method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116595124B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11983210B2 (en) * 2020-06-16 2024-05-14 Virginia Tech Intellectual Properties, Inc. Methods and systems for generating summaries given documents with questions and answers

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
国际转喻研究动态的科学知识图谱分析(2007-2016);金胜昔 等;《外语研究》(第3期);全文 *
基于学科内容的科研人员隐性合作关系研究;牟冬梅 等;《情报理论与实践》;第40卷(第7期);全文 *
基于引文与韦恩图法的国内隐性知识管理研究主题的演化分析;邰杨芳 等;《数字图书馆论坛》(第3期);全文 *
面向科学研究主题的文献隐含时间信息分析与挖掘;沈思 等;《情报学报》;第36卷(第4期);全文 *

Also Published As

Publication number Publication date
CN116595124A (en) 2023-08-15

Similar Documents

Publication Publication Date Title
US11216504B2 (en) Document recommendation method and device based on semantic tag
US9858261B2 (en) Relation extraction using manifold models
CN108334490B (en) Keyword extraction method and keyword extraction device
US9495358B2 (en) Cross-language text clustering
US20160140643A1 (en) Multilingual Content Based Recommendation System
Kanwal et al. Urdu named entity recognition: Corpus generation and deep learning applications
US9224103B1 (en) Automatic annotation for training and evaluation of semantic analysis engines
US9734238B2 (en) Context based passage retreival and scoring in a question answering system
CN107193892B (en) A kind of document subject matter determines method and device
CN111539193A (en) Ontology-based document analysis and annotation generation
Jehangir et al. A survey on Named Entity Recognition—datasets, tools, and methodologies
Qian et al. Detecting new Chinese words from massive domain texts with word embedding
Konjengbam et al. Aspect ontology based review exploration
Yan et al. A survey of automated International Classification of Diseases coding: development, challenges, and applications
Hinze et al. Improving access to large-scale digital libraries throughsemantic-enhanced search and disambiguation
Liu et al. Radar station: Using kg embeddings for semantic table interpretation and entity disambiguation
Meadows et al. Introduction to mathematical language processing: Informal proofs, word problems, and supporting tasks
Tran et al. The recent advances in automatic term extraction: A survey
Lee et al. A Context-Enhanced De-identification System
CN116595124B (en) Cross-language implicit associated knowledge discovery method, device, equipment and storage medium
Zhang et al. Adversarial transfer network with bilinear attention for the detection of adverse drug reactions from social media
Pan et al. Dmdd: A large-scale dataset for dataset mentions detection
Andrews et al. Sense induction in folksonomies: a review
Žitnik et al. SkipCor: Skip-mention coreference resolution using linear-chain conditional random fields
US20180260476A1 (en) Expert stance classification using computerized text analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant