CN113065343B - Enterprise research and development resource information modeling method based on semantics - Google Patents

Enterprise research and development resource information modeling method based on semantics Download PDF

Info

Publication number
CN113065343B
CN113065343B CN202110318900.0A CN202110318900A CN113065343B CN 113065343 B CN113065343 B CN 113065343B CN 202110318900 A CN202110318900 A CN 202110318900A CN 113065343 B CN113065343 B CN 113065343B
Authority
CN
China
Prior art keywords
resource information
development
enterprise
enterprise research
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110318900.0A
Other languages
Chinese (zh)
Other versions
CN113065343A (en
Inventor
王磊
马剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110318900.0A priority Critical patent/CN113065343B/en
Publication of CN113065343A publication Critical patent/CN113065343A/en
Application granted granted Critical
Publication of CN113065343B publication Critical patent/CN113065343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a semantic-based enterprise research and development resource information modeling method, which comprises the following steps: (1) constructing an enterprise research and development resource information text corpus T; (2) performing semantic-based text word segmentation on an enterprise research and development resource information text corpus T; (3) identifying an enterprise research and development resource information entity based on semantic analysis, and identifying the enterprise research and development resource information entity by a model based on the combination of a Hidden Markov Model (HMM) model and a viterbi algorithm; (4) extracting an entity identification relationship of enterprise research and development resource information based on semantic analysis, and extracting an entity relationship related to enterprise research and development information resources by adopting a snowball algorithm of semi-supervised learning; (5) performing dynamic analysis on enterprise research and development resources, and analyzing the use condition of resources in the enterprise by using a keyword extraction technology; (6) and extracting and discovering the relation between the enterprise research and development resource information entities.

Description

Enterprise research and development resource information modeling method based on semantics
Technical Field
The invention belongs to the field of information modeling based on semantics and big data, relates to a construction method for unified modeling of enterprise research and development design resources, and particularly relates to a semantic-based enterprise research and development resource information modeling method.
Background
With the overall development of socio-economic and related data-related technologies, search activities have been integrated into various corners of socio-economic. The related data search industry has become an important component of the social and economic system as a booming industry at home and abroad. Abundant knowledge elements and intelligence are hidden behind huge and diversified related data, but are not discovered and effectively utilized in time, which seriously influences the efficiency of related data searching activities. The existing research and development resources are mainly realized according to respective special expert information systems, but the actual sharing of the research and development resources among different departments of an enterprise is difficult[1]Based on the problem that the efficiency of retrieving information by keywords is low, the method for constructing enterprise research and development resource information based on semantics is provided. In addition, the traditional enterprise information modeling is only limited to the information model construction of a specific research and development system, such as product-related information modeling[2][3]Resource management method without integrity and dynamic property[4]. Aiming at the problem that the research and development design resources lack a uniform organization system and sharing mechanism among enterprises/factories belonging to enterprise level, the patent integrates the requirements of the whole life cycle of product design, manufacture and service on the sharing and integrated management of the design resources through the resource unificationThe method comprises the steps of establishing a modeling method and a coding system, constructing a resource panoramic space model, establishing a model, a method and a model system for enterprise research and development design resource integrated management and sharing based on the research of contents such as a shareable classified resource sharing mode and the like, and developing an application in a typical industry for enterprise research and development design resource sharing platforms.
In summary, the following disadvantages and shortcomings exist in the prior art:
(1) the information content of enterprise research and development resources is disorderly, unified management and utilization are lacked, and the discovery and effective utilization rate is low;
(2) the flexibility and comprehensiveness of the enterprise research and development resource information in actual sharing among different departments in the enterprise are difficult to meet the requirement of each department on the full utilization of resources;
(3) the existing information modeling method is only limited to information model construction in a single field, and does not perform integrated management based on semantics and big data on information resources in the whole life cycle of enterprise research and development.
Based on the method, aiming at defining high dynamics and uncertainty of enterprise design resource sharing and accurate capability of resources, the patent develops the research of enterprise research and development design resource unified modeling method based on semantics and big data, and breaks through key technologies such as enterprise research and development resource information entity meta-model construction, research and development design resource information model construction, unified coding system construction and the like through python and machine learning framework, natural language processing and the like
Reference to the literature
[1] Salix populi, picnic, navy, guo, expert field ontology modeling and semantic information services research [ J ]. small microcomputer systems, 2012,33(08): 1730-.
[2] Von willebrand, zunzhou, patent name: a product associated information modeling method using design intention as guidance is disclosed, and the application number is as follows: CN201710229610.2.
[3] Von willebrand, gao yi clever, patent name: a sharing and calling method of a numerical control machine tool design resource cloud mode is disclosed, and the application number is as follows: CN201310238060.2.
[4] Rong xi, high construction people, patent name: a data-driven process industry complex electromechanical system information modeling method is disclosed in application number CN201710631783.7.
Disclosure of Invention
In order to solve the problems in the background art, the invention aims to provide a semantic-based enterprise research and development resource information modeling method. The method comprises the following steps:
a semantic-based enterprise research and development resource information modeling method comprises the following steps:
(1) constructing an enterprise research and development resource information text corpus T;
(2) performing semantic-based text word segmentation on an enterprise research and development resource information text corpus T;
(3) the method comprises the following steps of identifying an enterprise research and development resource information entity based on semantic analysis, identifying the enterprise research and development resource information entity by a model based on the combination of a Hidden Markov Model (HMM) and a viterbi algorithm, and:
the first step is as follows: processing a T text to be input into an enterprise research and development resource information text corpus by using an enterprise research and development resource information text corpus T training model and combining a state sequence result generated in the segmentation of the enterprise research and development resource information text based on semantic analysis;
the second step is that: processing a T text to be input into an enterprise research and development resource information text corpus by combining a state sequence result generated in semantic analysis-based enterprise research and development resource information text participle, and identifying an enterprise research and development resource information entity according to the solved state sequence;
(4) extracting an entity identification relationship of enterprise research and development resource information based on semantic analysis, and extracting an entity relationship related to enterprise research and development information resources by adopting a snowball algorithm of semi-supervised learning, wherein the method comprises the following steps:
the first step is as follows: inputting a text to be processed, and labeling a resource information entity identified in an enterprise research and development resource information entity in the text to be processed;
the second step is that: defining the length of word-taking before and after the resource information entity;
the third step: and (3) generating a rule: according to the word-taking result before and after the resource information entity, a text to be processed is formed, and the structure is converted into: word vector + entity class + word vector, denoted as rule (L, T, M, T, R);
the fourth step: and (3) calculating rule similarity: for rule 1 (L)1,T1,M1,T1,R1) Rule 2 (L)2,T2,M2,T2,R2) If T is1Is not equal to T2If the rule 1 and the rule 2 have no similarity; otherwise, the similarity between rule 1 and rule 2 is equal to W1 L1 L2+W2 M1M2+W3 R1 R2Wherein W1, W2 and W3 are the weights of the corresponding word vectors, and the weight of the intermediate word vector is larger;
(5) the method comprises the following steps of carrying out dynamic analysis on enterprise research and development resources, and analyzing the use condition of internal resources of an enterprise by utilizing a keyword extraction technology, wherein the method comprises the following steps:
the first step is as follows: establishing a stop word corpus, and removing stop words from the obtained segmented word text, wherein the contents of the stop word corpus comprise punctuation marks, common words, and words except nouns, verbs, adjectives and adverbs to obtain actual useful words;
the second step is that: and (3) extracting keywords by combining a TF-IDF algorithm: setting a word frequency TF as the occurrence frequency of a certain enterprise research and development resource information entity word in an enterprise research and development resource information text corpus T/the total frequency of the enterprise research and development resource information text corpus T, and calculating the TF-IDF value of all words, wherein the inverse document frequency IDF is log (the total number of documents in the enterprise resource information text corpus/the number of documents containing the enterprise research and development resource information entity word +1), so that the use dynamics of group research and development resources is analyzed according to the extracted enterprise research and development resource information entity word key words, and the reference is extracted for further enterprise research and development resource information entity relationship extraction;
(6) extracting and discovering the relation between enterprise research and development resource information entities: in order to extract enterprise research and development resource information entities, entity objects are all entities related to enterprise research and development resources, the relationships among the entities are extracted, and composition relationship tuples of corresponding enterprise resource information entities are extracted from an enterprise research and development resource information text corpus T.
Further, based on prefix dictionary DfThe method realizes word graph scanning, generates a directed acyclic graph DAG formed by all possible word forming conditions of Chinese characters in a T text of an enterprise research and development resource information text corpus, and comprises the following steps:
sequentially traversing each position of a T text of an enterprise research and development resource information text corpus from front to back, firstly forming a segment L for a position k, judging whether the segment L is in a prefix dictionary D or not, wherein the segment L only contains words of the position kfThe method comprises the following steps:
1) if the segment L is in the prefix dictionary DfThe method comprises the following steps:
a) if the fragment L at a certain position i contains a word frequency P of a position k which is more than 0, adding the position i into a list taking k as key;
b) if a segment L at a certain position i contains words at a position k with a frequency P equal to 0, this indicates a prefix dictionary DfIf the prefix exists but the statistical dictionary does not have the word, continuing circulation;
2) if the segment L is not in the prefix dictionary DfThe method comprises the following steps:
a) indicating that the segment L is beyond the range of the word in the statistical dictionary, and terminating the cycle;
b) adding 1 to the position i to form a new fragment L;
3) repeating the step 1) and the step 2), and continuously judging whether the new segment L is in the prefix dictionary DfIn the method, the T text traversal is finished until an enterprise research and development resource information text corpus is input;
4) and generating a directed acyclic graph DAG formed by all possible word forming conditions of the Chinese characters in the T text of the input enterprise research and development resource information text corpus.
The technical scheme provided by the invention has the beneficial effects that:
(1) according to the invention, by constructing the enterprise resource sharing model, each employee can be contacted and used, so that resources in an enterprise can be fully and comprehensively utilized, the utilization efficiency of the resources is improved, and the resource synergistic effect is realized.
(2) The enterprise resource sharing method realizes enterprise resource sharing by constructing the enterprise resource sharing model, is favorable for promoting enterprise staff to update resources and innovate resources, is favorable for enterprises to save training and technical transformation cost, and reduces research and development expenses.
(3) The enterprise resource sharing model constructed by the invention is beneficial to accelerating new product development and product transformation of enterprises, increasing competitiveness and simultaneously being beneficial to enhancing cohesion of enterprises
Drawings
FIG. 1 is a flow chart of an enterprise research and development resource information modeling method based on semantics
FIG. 2 is a semantic-based resource unifying model anticipation function
FIG. 3 shows the segmentation result of T text in the corpus of research and development resources of enterprise
FIG. 4 shows the result of word segmentation of the read text
FIG. 5 shows the results of the enterprise research and development resource information entity
Fig. 6 snowball algorithm principle
FIG. 7 entity relationship extraction results
FIG. 8 is a flowchart of a process for using the Viterbi algorithm to find the maximum probability logarithm and the optimal path
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below
1. Construction of enterprise research and development resource information text corpus T
The enterprise resources in the patent are presented in a text form, and the content mainly refers to the information text of enterprise research and development resources related to the enterprise in the research and development process, including but not limited to papers, experimental reports, specifications, and the like. And uniformly coding the enterprise research and development resource information text to form an enterprise research and development resource information text corpus T for subsequent use.
2. Enterprise research and development resource information text participle based on semantic analysis
Step 1: constructing a prefix dictionary
Analyzing the off-line statistical dictionary text file, wherein each line respectively corresponds to a word, a word frequency and a part of speech, extracting the word and the word frequency, taking the word as a key, taking the word frequency as a value, and adding the word and the word frequency into prefix wordsDian DfIn (1).
Loading enterprise research and development resource information text corpus T, respectively acquiring prefix words of each word of the text input into the enterprise research and development resource information text corpus T, and if the prefix words exist in a prefix dictionary DfIf yes, no treatment is carried out; if the prefix word is not in the prefix dictionary DfAnd if the word frequency is 0, the directed acyclic graph is conveniently constructed subsequently.
Step 2: constructing Directed Acyclic Graphs (DAG)
jieba adopts the dit structure of Python, and the final Directed Acyclic Graph (DAG) is { k: [ k, j.],m:[m,p,q],., where k and m are the corresponding positions of words in the input enterprise research and development resource information text corpus T text, and the list corresponding to k stores the words beginning with k and [ k: j +1] in the text]In prefix dictionary DfThe list of words in (1) beginning with k and ending with j specifically operates as follows:
sequentially traversing each position of a T text of an enterprise research and development resource information text corpus from front to back, firstly forming a segment L for a position k, judging whether the segment L is in a prefix dictionary D or not, wherein the segment L only contains words of the position kfThe method comprises the following steps:
1) if the segment L is in the prefix dictionary DfThe method comprises the following steps:
a) if the fragment L at a certain position i contains a word frequency P of a position k which is more than 0(P >0), adding the position i to a list taking k as key;
b) if the segment L at a certain position i contains a word at a position k, the word frequency P is equal to 0(P ═ 0), it indicates that the prefix dictionary Df has the prefix, but the statistical dictionary does not have the word, and the loop continues;
2) if the segment L is not in the prefix dictionary DfThe method comprises the following steps:
a) indicating that the segment L is beyond the range of the word in the statistical dictionary, and terminating the cycle;
b) adding 1 to the position to form a new segment L, wherein the index of the new segment L in the text is [ k: i +1],
3) repeating the step 1) and the step 2), and continuously judging whether the new segment L is in the prefix dictionary DfFrom middle to upper to transmissionAnd finishing T text traversal of the resource information text corpus in enterprise research and development.
4) And generating a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of the Chinese characters in the T text of the input enterprise research and development resource information text corpus.
And 3, step 3: dynamic planning and searching maximum probability path
Each node of the Directed Acyclic Graph (DAG) constructed in step 2 has a weight of w, which is in the prefix dictionary DfThe term word frequency in (1); the path weight for the directed acyclic graph DAG may be expressed as route ═ (w)1,w2,w3,...,wn) To make Σ weight (w)i) The maximum concrete method is as follows:
and calculating the probability logarithm scores of the clauses [ idx-N-1 ] in a mode of traversing each word (idx) of the text sentences in reverse order from the last word (N-1) of the T text sentences input into the enterprise research and development resource information text corpus.
The case with the highest probability logarithm score is stored in (w) as a tuple of (probability logarithm, last position of word)1,w2,w3,...,wn) In (1). And finally, solving the maximum probability path.
And 4, step 4: calculating and identifying unknown words
The method adopts a fusion Viterbi algorithm and a Hidden Markov Model (HMM) to calculate and recognize the unknown words. The specific method comprises the following steps:
1) firstly, an enterprise research and development resource information text corpus T is used for training a Hidden Markov Model (HMM), and the frequency is calculated to respectively obtain the initial state probability, the state transition probability and the emission probability of the Hidden Markov Model (HMM) (the related probability calculation method is clearly described and is not repeated herein).
2) And solving the maximum probability logarithm and the optimal path by using a Viterbi algorithm and the known initial state probability, the state transition probability and the transmission probability, and converting the text to be segmented into a state sequence of a BMES type. Wherein the viterbi algorithm flow is specifically shown in fig. 8.
And 5, step 5: outputting the analysis result
Segmenting the internal resource information text of the enterprise by combining the models obtained by training in the steps:
the input enterprise research and development resource information text corpus T text is' CK6132 numerical control machine which belongs to Beijing mechanical industry automation research institute limited company and can be provided for users to be three-level mechanical engineers. "
The result of the word segmentation is "CK 6132/NC machine/belonging to/Beijing/machinery/industry/automation/institute/Limited/,/available/personnel/is/third level/machinery/engineer/. "fig. 3 and fig. 4 are the results of inputting the T text segmentation of the enterprise research and development resource information text corpus and reading the text segmentation, respectively.
3. Enterprise research and development resource information entity identification based on semantic analysis
The method identifies the enterprise research and development resource information entity by the model in a mode of combining a Hidden Markov Model (HMM) model and a viterbi algorithm, and comprises the following specific processes:
step 1: and processing the T text of the enterprise research and development resource information text corpus to be input by using an enterprise research and development resource information text corpus T training model and combining a state sequence result generated in the segmentation of the enterprise research and development resource information text based on semantic analysis.
Step 2: the input enterprise research and development resource information text corpus T text 'Wangchun' is a worker of the design department of the Beijing mechanical industry automation research institute, and is mainly responsible for carrying out mechanical analysis on the interior of an engine by using ANSYS. "
And 3, step 3: the output recognition result is "Wang _ B-PER | Xiao _ I-PER | Lin _ I-PER | is _ O | North _ B-ORG | Jing _ I-ORG | mechatronic _ I-ORG | mechanical _ I-ORG | worker _ I-ORG | business _ I-ORG | Automation _ I-ORG | study _ I-ORG | owned _ I-ORG | has _ I-ORG | Limit _ I-ORG | public _ I-ORG | I-ORG | design | I-ORG | count _ I-ORG | part I-ORG | door _ I | O _ worker _ O | which is to be responsible for _ O | negative _ O | make O | use O | A | B _ ORG | O | in the event of N _ O _ I-ORG | to be responsible for O | to O | in the event of O | O _ I _ O _ I-O _ I-ORG | to be used for the event of O _ I _ ORG _ I _ O _ I-O _ I _ O _ I-O _ I _ O _ I-O _ I _ O _ I-O _ I-O _ I-ORG _ I-O _ I _ O _ I-O _ I-O _ I-O _ I-O _ I-O _ I-O _ I-O _ I-O _ I-O _ I-O _ I-O _ O O part O go O line O force O learning O separation O analysis O. O | is described.
And 4, step 4: the marked and identified enterprise research and development resource information entities are respectively Wangchun (name of people), Beijing mechanical industry automation research institute limited company design department (name of organizational organization) and ANSYS. FIG. 5 shows the results of an enterprise research and development of resource information entities.
4. Enterprise research and development resource information entity relation extraction based on semantic analysis
The sonwball algorithm adopting semi-supervised learning has the basic principle as shown in fig. 6, and specifically comprises the following steps:
step 1: inputting a text to be processed and marking an enterprise research and development resource information entity in the text to be processed, wherein the entity is 'the King of Xiao Lin in the morning of today controls the part processed by a CK6132 numerical control machine tool'. The identified resource information entities are entity 1 'Wangchun' and entity 2 'CK 6132 numerical control machine tool'.
Step 2: defining the length before and after word extraction: defining the length of the word-taking before and after the resource information entity as 2, then the word-taking before (today) and (morning) in the Wangchun of the entity 1, and the word-taking after (operation) and the entity 2; similarly, the entity 2 "CK 6132 numerical control machine tool" takes words in the forward direction [ entity 1, (control) ], and takes words in the backward direction [ (processing), (part) ].
And 3, step 3: and (3) generating a rule: and forming a text to be processed according to the word-taking result before and after the resource information entity: (lift + entity + Middle + entity + Right) structure, and converting the structure into: (word vector + entity class + word vector), denoted as rule (L, T, M, T, R).
And 4, step 4: and (3) calculating rule similarity: for rule 1 (L)1,T1,M1,T1,R1) Rule 2 (L)2,T2,M2,T2,R2) If T is1Is not equal to T2If the rule 1 and the rule 2 have no similarity; otherwise, rule 1 and rule 2 have similarity S ═ W1 L1 L2+W2 M1M2+W3 R1 R2Where W is the weight of the corresponding word vector, and the weight of the intermediate word vector is generally greater.
5. Enterprise research and development resource usage dynamics
Firstly, establishing a stop word corpus
And removing stop words from the obtained segmented word text, wherein the content of the stop word corpus comprises punctuation marks, common words, and words except nouns, verbs, adjectives and adverbs to obtain practical and useful words.
And secondly, combining with a TF-IDF algorithm, automatically extracting key words, and judging and analyzing the use dynamics of enterprise research and development resources.
The corresponding TF-IDF algorithm is formulated as:
Figure GDA0003593832280000071
Figure GDA0003593832280000072
TF-IDF ═ word frequency (TF) x Inverse Document Frequency (IDF)
The TF-IDF value of a word is proportional to the frequency of its occurrence in the document and inversely proportional to the frequency of its occurrence in the entire corpus, with a greater TF-IDF value indicating a greater importance of the word to the current document and vice versa. Therefore, the automatic keyword extraction is to calculate the TF-IDF values of all words in the document, and then arrange the words in descending order to take the first few words.
In this patent, a word frequency (TF) is the number of occurrences of a certain enterprise resource information research and development entity word in an enterprise resource information corpus T/the total number of occurrences of the enterprise resource information corpus T, and an Inverse Document Frequency (IDF) is log (the total number of documents in the enterprise resource information corpus/the number of documents including the enterprise resource information entity word + 1). Therefore, the use dynamics of the existing group research and development resources are analyzed according to the extracted enterprise research and development resource information entity vocabulary key words, and reference is made for further extracting the enterprise research and development resource information entity relation.
5. Enterprise research and development resource information entity relationship extraction
The method is characterized in that a T text of an enterprise research and development resource information text corpus is input, wherein the maximum rotation diameter of a CK6132 numerical control machine tool body is 390mm, the CK is affiliated to Beijing mechanical industry automation research institute limited company, and a user can be a third-level mechanical engineer. The solid works is three-dimensional modeling software, supports various operating systems and is stored in a D disk. "
And (4) extracting results: the present invention relates to a method for measuring a rotation angle of a numerical control machine tool, and more particularly to a method for measuring a rotation angle of a numerical control machine tool. As a result, as shown in FIG. 7, it can be seen that the model extracted the relationship between "the maximum revolution diameter of the numerically controlled machine tool" and "390 mm". For the solid work enterprise research and development resource information entity, the model extracts two triples and provides the function of the entity and the supported operating system.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing examples, or equivalent to some of the technical features of the present invention, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (2)

1. A semantic-based enterprise research and development resource information modeling method comprises the following steps:
(1) constructing an enterprise research and development resource information text corpus T;
(2) performing semantic-based text word segmentation on an enterprise research and development resource information text corpus T;
(3) the method comprises the following steps of identifying an enterprise research and development resource information entity based on semantic analysis, identifying the enterprise research and development resource information entity by a model based on the combination of a Hidden Markov Model (HMM) and a viterbi algorithm, and:
the first step is as follows: processing a T text to be input into an enterprise research and development resource information text corpus by using an enterprise research and development resource information text corpus T training model and combining a state sequence result generated in the segmentation of the enterprise research and development resource information text based on semantic analysis;
the second step is that: processing a T text to be input into an enterprise research and development resource information text corpus by combining a state sequence result generated in semantic analysis-based enterprise research and development resource information text participle, and identifying an enterprise research and development resource information entity according to the solved state sequence;
(4) extracting an entity identification relationship of enterprise research and development resource information based on semantic analysis, and extracting an entity relationship related to enterprise research and development information resources by adopting a snowball algorithm of semi-supervised learning, wherein the method comprises the following steps:
the first step is as follows: inputting a text to be processed, and labeling a resource information entity identified in an enterprise research and development resource information entity in the text to be processed;
the second step is that: defining the length of word-taking before and after the resource information entity;
the third step: and (3) generating a rule: according to the word-taking result before and after the resource information entity, a text to be processed is formed, and the structure is converted into: word vector + entity class + word vector, denoted as rule (L, T, M, T, R);
the fourth step: and (3) calculating rule similarity: for rule 1 (L)1,T1,M1,T1,R1) Rule 2 (L)2,T2,M2,T2,R2) If T is1Is not equal to T2If the rule 1 and the rule 2 have no similarity; otherwise, rule 1 and rule 2 have similarity S ═ W1 L1 L2+W2 M1 M2+W3R1 R2Wherein W1, W2 and W3 are the weights of the corresponding word vectors, and the weight of the intermediate word vector is larger;
(5) the method comprises the following steps of carrying out dynamic analysis on enterprise research and development resources, and analyzing the use condition of resources in an enterprise by utilizing a keyword extraction technology, wherein the method comprises the following steps:
the first step is as follows: establishing a stop word corpus, and removing stop words from the obtained segmented word text, wherein the contents of the stop word corpus comprise punctuation marks, common words, and words except nouns, verbs, adjectives and adverbs to obtain practical useful words;
the second step is that: and (3) extracting keywords by combining a TF-IDF algorithm: setting a word frequency TF as the occurrence frequency of a certain enterprise research and development resource information entity word in an enterprise research and development resource information text corpus T/the total frequency of the enterprise research and development resource information text corpus T, and calculating the TF-IDF value of all words, wherein the inverse document frequency IDF is log (the total number of documents in the enterprise resource information text corpus/the number of documents containing the enterprise research and development resource information entity word +1), so that the use dynamics of group research and development resources is analyzed according to the extracted enterprise research and development resource information entity word key words, and the reference is extracted for further enterprise research and development resource information entity relationship extraction;
(6) extracting and discovering the relation between enterprise research and development resource information entities: in order to extract enterprise research and development resource information entities, entity objects are all entities related to enterprise research and development resources, the relationships among the entities are extracted, and composition relationship tuples of corresponding enterprise resource information entities are extracted from an enterprise research and development resource information text corpus T.
2. The method of claim 1, wherein the prefix-based dictionary D is based onfThe method comprises the following steps of realizing word graph scanning, generating a directed acyclic graph DAG formed by all possible word forming conditions of Chinese characters in a T text of an enterprise research and development resource information text corpus, and generating the directed acyclic graph DAG by the following steps:
sequentially traversing each position of a T text of an enterprise research and development resource information text corpus from front to back, firstly forming a segment L for a position k, judging whether the segment L is in a prefix dictionary D or not, wherein the segment L only contains words of the position kfThe method comprises the following steps:
1) if the segment L is in the prefix dictionary DfThe method comprises the following steps:
a) if the fragment L at a certain position i contains a word frequency P of a position k which is more than 0, adding the position i into a list taking k as key;
b) if a segment L at a certain position i contains a word frequency P at position kEqual to 0, this indicates a prefix dictionary DfIf the prefix exists but the statistical dictionary does not have the word, continuing circulation;
2) if the segment L is not in the prefix dictionary DfThe method comprises the following steps:
a) indicating that the segment L is beyond the range of the word in the statistical dictionary, and terminating the cycle;
b) adding 1 to the position i to form a new fragment L;
3) repeating the step 1) and the step 2), and continuously judging whether the new segment L is in the prefix dictionary DfIn the method, the T text traversal is finished until an enterprise research and development resource information text corpus is input;
4) and generating a directed acyclic graph DAG formed by all possible word forming conditions of Chinese characters in a T text of the input enterprise research and development resource information text corpus.
CN202110318900.0A 2021-03-25 2021-03-25 Enterprise research and development resource information modeling method based on semantics Active CN113065343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110318900.0A CN113065343B (en) 2021-03-25 2021-03-25 Enterprise research and development resource information modeling method based on semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110318900.0A CN113065343B (en) 2021-03-25 2021-03-25 Enterprise research and development resource information modeling method based on semantics

Publications (2)

Publication Number Publication Date
CN113065343A CN113065343A (en) 2021-07-02
CN113065343B true CN113065343B (en) 2022-06-10

Family

ID=76561853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110318900.0A Active CN113065343B (en) 2021-03-25 2021-03-25 Enterprise research and development resource information modeling method based on semantics

Country Status (1)

Country Link
CN (1) CN113065343B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814486A (en) * 2020-07-10 2020-10-23 东软集团(上海)有限公司 Enterprise client tag generation method, system and device based on semantic analysis
CN112307153A (en) * 2020-09-30 2021-02-02 杭州量知数据科技有限公司 Automatic construction method and device of industrial knowledge base and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372058B (en) * 2016-08-29 2019-10-15 中译语通科技股份有限公司 A kind of short text Emotional Factors abstracting method and device based on deep learning
CN107193959B (en) * 2017-05-24 2020-11-27 南京大学 Pure text-oriented enterprise entity classification method
US10394958B2 (en) * 2017-11-09 2019-08-27 Conduent Business Services, Llc Performing semantic analyses of user-generated text content using a lexicon
CN108415953B (en) * 2018-02-05 2021-08-13 华融融通(北京)科技有限公司 Method for managing bad asset management knowledge based on natural language processing technology
CN108460014B (en) * 2018-02-07 2022-02-25 百度在线网络技术(北京)有限公司 Enterprise entity identification method and device, computer equipment and storage medium
CN109101477B (en) * 2018-06-04 2023-01-31 东南大学 Enterprise field classification and enterprise keyword screening method
CN110020424B (en) * 2019-01-04 2023-10-31 创新先进技术有限公司 Contract information extraction method and device and text information extraction method
CN111008530A (en) * 2019-12-03 2020-04-14 中国石油大学(华东) Complex semantic recognition method based on document word segmentation
CN111767716A (en) * 2020-06-24 2020-10-13 中国平安财产保险股份有限公司 Method and device for determining enterprise multilevel industry information and computer equipment
CN112036178A (en) * 2020-08-25 2020-12-04 国家电网有限公司 Distribution network entity related semantic search method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814486A (en) * 2020-07-10 2020-10-23 东软集团(上海)有限公司 Enterprise client tag generation method, system and device based on semantic analysis
CN112307153A (en) * 2020-09-30 2021-02-02 杭州量知数据科技有限公司 Automatic construction method and device of industrial knowledge base and storage medium

Also Published As

Publication number Publication date
CN113065343A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN104361127B (en) The multilingual quick constructive method of question and answer interface based on domain body and template logic
Song et al. Named entity recognition based on conditional random fields
Qimin et al. Text clustering using VSM with feature clusters
Curtotti et al. Corpus based classification of text in Australian contracts
Wick et al. A unified approach for schema matching, coreference and canonicalization
Piryani et al. Sentiment analysis in Nepali: exploring machine learning and lexicon-based approaches
Gaizauskas et al. Using a semantic network for information extraction
CN110399603A (en) A kind of text-processing technical method and system based on sense-group division
Adrian et al. iDocument: using ontologies for extracting and annotating information from unstructured text
CN113065343B (en) Enterprise research and development resource information modeling method based on semantics
CN114722159B (en) Multi-source heterogeneous data processing method and system for numerical control machine tool manufacturing resources
Wang et al. Natural language processing systems and Big Data analytics
Putra et al. Textual Entailment Technique for the Bahasa Using BiLSTM
Sarma et al. Word level language identification in Assamese-Bengali-Hindi-English code-mixed social media text
Lokman et al. A conceptual IR chatbot framework with automated keywords-based vector representation generation
Liu et al. A survey of deep learning for named entity recognition in Chinese social media
Khalil et al. Challenges in information retrieval from unstructured arabic data
Easwar et al. Automatic text summarization using word embeddings
Hu et al. An information extraction method for sedimentology literature with semantic rules
Gao Analysis of English Machine Translation Methods Based on Intelligent Fuzzy Decision Tree Algorithm
Rajeshwari et al. Regional Language Code-Switching for Natural Language Understanding and Intelligent Digital Assistants
Zheng et al. Research on Text Classification of Non-development Work in Software Projects based on Bert
Wang Fine-grained opinion mining on Chinese car reviews with conditional random field
Wan et al. An E-mail Classification Algorithm based on Stacking Integrated Learning
Radhakrishna Intent Based Utterance Segmentation for Multi IntentNLU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant