CN113065343B

CN113065343B - Enterprise research and development resource information modeling method based on semantics

Info

Publication number: CN113065343B
Application number: CN202110318900.0A
Authority: CN
Inventors: 王磊; 马剑
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2022-06-10
Anticipated expiration: 2041-03-25
Also published as: CN113065343A

Abstract

The invention relates to a semantic-based enterprise research and development resource information modeling method, which comprises the following steps: (1) constructing an enterprise research and development resource information text corpus T; (2) performing semantic-based text word segmentation on an enterprise research and development resource information text corpus T; (3) identifying an enterprise research and development resource information entity based on semantic analysis, and identifying the enterprise research and development resource information entity by a model based on the combination of a Hidden Markov Model (HMM) model and a viterbi algorithm; (4) extracting an entity identification relationship of enterprise research and development resource information based on semantic analysis, and extracting an entity relationship related to enterprise research and development information resources by adopting a snowball algorithm of semi-supervised learning; (5) performing dynamic analysis on enterprise research and development resources, and analyzing the use condition of resources in the enterprise by using a keyword extraction technology; (6) and extracting and discovering the relation between the enterprise research and development resource information entities.

Description

Enterprise research and development resource information modeling method based on semantics

Technical Field

The invention belongs to the field of information modeling based on semantics and big data, relates to a construction method for unified modeling of enterprise research and development design resources, and particularly relates to a semantic-based enterprise research and development resource information modeling method.

Background

With the overall development of socio-economic and related data-related technologies, search activities have been integrated into various corners of socio-economic. The related data search industry has become an important component of the social and economic system as a booming industry at home and abroad. Abundant knowledge elements and intelligence are hidden behind huge and diversified related data, but are not discovered and effectively utilized in time, which seriously influences the efficiency of related data searching activities. The existing research and development resources are mainly realized according to respective special expert information systems, but the actual sharing of the research and development resources among different departments of an enterprise is difficult^[1]Based on the problem that the efficiency of retrieving information by keywords is low, the method for constructing enterprise research and development resource information based on semantics is provided. In addition, the traditional enterprise information modeling is only limited to the information model construction of a specific research and development system, such as product-related information modeling^[2][3]Resource management method without integrity and dynamic property^[4]. Aiming at the problem that the research and development design resources lack a uniform organization system and sharing mechanism among enterprises/factories belonging to enterprise level, the patent integrates the requirements of the whole life cycle of product design, manufacture and service on the sharing and integrated management of the design resources through the resource unificationThe method comprises the steps of establishing a modeling method and a coding system, constructing a resource panoramic space model, establishing a model, a method and a model system for enterprise research and development design resource integrated management and sharing based on the research of contents such as a shareable classified resource sharing mode and the like, and developing an application in a typical industry for enterprise research and development design resource sharing platforms.

In summary, the following disadvantages and shortcomings exist in the prior art:

(1) the information content of enterprise research and development resources is disorderly, unified management and utilization are lacked, and the discovery and effective utilization rate is low;

(2) the flexibility and comprehensiveness of the enterprise research and development resource information in actual sharing among different departments in the enterprise are difficult to meet the requirement of each department on the full utilization of resources;

(3) the existing information modeling method is only limited to information model construction in a single field, and does not perform integrated management based on semantics and big data on information resources in the whole life cycle of enterprise research and development.

Based on the method, aiming at defining high dynamics and uncertainty of enterprise design resource sharing and accurate capability of resources, the patent develops the research of enterprise research and development design resource unified modeling method based on semantics and big data, and breaks through key technologies such as enterprise research and development resource information entity meta-model construction, research and development design resource information model construction, unified coding system construction and the like through python and machine learning framework, natural language processing and the like

Reference to the literature

[1] Salix populi, picnic, navy, guo, expert field ontology modeling and semantic information services research [ J ]. small microcomputer systems, 2012,33(08): 1730-.

[2] Von willebrand, zunzhou, patent name: a product associated information modeling method using design intention as guidance is disclosed, and the application number is as follows: CN201710229610.2.

[3] Von willebrand, gao yi clever, patent name: a sharing and calling method of a numerical control machine tool design resource cloud mode is disclosed, and the application number is as follows: CN201310238060.2.

[4] Rong xi, high construction people, patent name: a data-driven process industry complex electromechanical system information modeling method is disclosed in application number CN201710631783.7.

Disclosure of Invention

In order to solve the problems in the background art, the invention aims to provide a semantic-based enterprise research and development resource information modeling method. The method comprises the following steps:

a semantic-based enterprise research and development resource information modeling method comprises the following steps:

(1) constructing an enterprise research and development resource information text corpus T;

(2) performing semantic-based text word segmentation on an enterprise research and development resource information text corpus T;

(3) the method comprises the following steps of identifying an enterprise research and development resource information entity based on semantic analysis, identifying the enterprise research and development resource information entity by a model based on the combination of a Hidden Markov Model (HMM) and a viterbi algorithm, and:

the first step is as follows: processing a T text to be input into an enterprise research and development resource information text corpus by using an enterprise research and development resource information text corpus T training model and combining a state sequence result generated in the segmentation of the enterprise research and development resource information text based on semantic analysis;

the second step is that: processing a T text to be input into an enterprise research and development resource information text corpus by combining a state sequence result generated in semantic analysis-based enterprise research and development resource information text participle, and identifying an enterprise research and development resource information entity according to the solved state sequence;

(4) extracting an entity identification relationship of enterprise research and development resource information based on semantic analysis, and extracting an entity relationship related to enterprise research and development information resources by adopting a snowball algorithm of semi-supervised learning, wherein the method comprises the following steps:

the first step is as follows: inputting a text to be processed, and labeling a resource information entity identified in an enterprise research and development resource information entity in the text to be processed;

the second step is that: defining the length of word-taking before and after the resource information entity;

the third step: and (3) generating a rule: according to the word-taking result before and after the resource information entity, a text to be processed is formed, and the structure is converted into: word vector + entity class + word vector, denoted as rule (L, T, M, T, R);

the fourth step: and (3) calculating rule similarity: for rule 1 (L)₁,T₁,M₁,T₁,R₁) Rule 2 (L)₂,T₂,M₂,T₂,R₂) If T is₁Is not equal to T₂If the rule 1 and the rule 2 have no similarity; otherwise, the similarity between rule 1 and rule 2 is equal to W₁ L₁ L₂+W₂ M₁M₂+W₃ R₁ R₂Wherein W1, W2 and W3 are the weights of the corresponding word vectors, and the weight of the intermediate word vector is larger;

(5) the method comprises the following steps of carrying out dynamic analysis on enterprise research and development resources, and analyzing the use condition of internal resources of an enterprise by utilizing a keyword extraction technology, wherein the method comprises the following steps:

the first step is as follows: establishing a stop word corpus, and removing stop words from the obtained segmented word text, wherein the contents of the stop word corpus comprise punctuation marks, common words, and words except nouns, verbs, adjectives and adverbs to obtain actual useful words;

the second step is that: and (3) extracting keywords by combining a TF-IDF algorithm: setting a word frequency TF as the occurrence frequency of a certain enterprise research and development resource information entity word in an enterprise research and development resource information text corpus T/the total frequency of the enterprise research and development resource information text corpus T, and calculating the TF-IDF value of all words, wherein the inverse document frequency IDF is log (the total number of documents in the enterprise resource information text corpus/the number of documents containing the enterprise research and development resource information entity word +1), so that the use dynamics of group research and development resources is analyzed according to the extracted enterprise research and development resource information entity word key words, and the reference is extracted for further enterprise research and development resource information entity relationship extraction;

(6) extracting and discovering the relation between enterprise research and development resource information entities: in order to extract enterprise research and development resource information entities, entity objects are all entities related to enterprise research and development resources, the relationships among the entities are extracted, and composition relationship tuples of corresponding enterprise resource information entities are extracted from an enterprise research and development resource information text corpus T.

Further, based on prefix dictionary D_fThe method realizes word graph scanning, generates a directed acyclic graph DAG formed by all possible word forming conditions of Chinese characters in a T text of an enterprise research and development resource information text corpus, and comprises the following steps:

sequentially traversing each position of a T text of an enterprise research and development resource information text corpus from front to back, firstly forming a segment L for a position k, judging whether the segment L is in a prefix dictionary D or not, wherein the segment L only contains words of the position k_fThe method comprises the following steps:

1) if the segment L is in the prefix dictionary D_fThe method comprises the following steps:

a) if the fragment L at a certain position i contains a word frequency P of a position k which is more than 0, adding the position i into a list taking k as key;

b) if a segment L at a certain position i contains words at a position k with a frequency P equal to 0, this indicates a prefix dictionary D_fIf the prefix exists but the statistical dictionary does not have the word, continuing circulation;

2) if the segment L is not in the prefix dictionary D_fThe method comprises the following steps:

a) indicating that the segment L is beyond the range of the word in the statistical dictionary, and terminating the cycle;

b) adding 1 to the position i to form a new fragment L;

3) repeating the step 1) and the step 2), and continuously judging whether the new segment L is in the prefix dictionary D_fIn the method, the T text traversal is finished until an enterprise research and development resource information text corpus is input;

4) and generating a directed acyclic graph DAG formed by all possible word forming conditions of the Chinese characters in the T text of the input enterprise research and development resource information text corpus.

The technical scheme provided by the invention has the beneficial effects that:

(1) according to the invention, by constructing the enterprise resource sharing model, each employee can be contacted and used, so that resources in an enterprise can be fully and comprehensively utilized, the utilization efficiency of the resources is improved, and the resource synergistic effect is realized.

(2) The enterprise resource sharing method realizes enterprise resource sharing by constructing the enterprise resource sharing model, is favorable for promoting enterprise staff to update resources and innovate resources, is favorable for enterprises to save training and technical transformation cost, and reduces research and development expenses.

(3) The enterprise resource sharing model constructed by the invention is beneficial to accelerating new product development and product transformation of enterprises, increasing competitiveness and simultaneously being beneficial to enhancing cohesion of enterprises

Drawings

FIG. 1 is a flow chart of an enterprise research and development resource information modeling method based on semantics

FIG. 2 is a semantic-based resource unifying model anticipation function

FIG. 3 shows the segmentation result of T text in the corpus of research and development resources of enterprise

FIG. 4 shows the result of word segmentation of the read text

FIG. 5 shows the results of the enterprise research and development resource information entity

Fig. 6 snowball algorithm principle

FIG. 7 entity relationship extraction results

FIG. 8 is a flowchart of a process for using the Viterbi algorithm to find the maximum probability logarithm and the optimal path

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below

1. Construction of enterprise research and development resource information text corpus T

The enterprise resources in the patent are presented in a text form, and the content mainly refers to the information text of enterprise research and development resources related to the enterprise in the research and development process, including but not limited to papers, experimental reports, specifications, and the like. And uniformly coding the enterprise research and development resource information text to form an enterprise research and development resource information text corpus T for subsequent use.

2. Enterprise research and development resource information text participle based on semantic analysis

Step 1: constructing a prefix dictionary

Analyzing the off-line statistical dictionary text file, wherein each line respectively corresponds to a word, a word frequency and a part of speech, extracting the word and the word frequency, taking the word as a key, taking the word frequency as a value, and adding the word and the word frequency into prefix wordsDian D_fIn (1).

Loading enterprise research and development resource information text corpus T, respectively acquiring prefix words of each word of the text input into the enterprise research and development resource information text corpus T, and if the prefix words exist in a prefix dictionary D_fIf yes, no treatment is carried out; if the prefix word is not in the prefix dictionary D_fAnd if the word frequency is 0, the directed acyclic graph is conveniently constructed subsequently.

Step 2: constructing Directed Acyclic Graphs (DAG)

jieba adopts the dit structure of Python, and the final Directed Acyclic Graph (DAG) is { k: [ k, j.],m:[m,p,q],., where k and m are the corresponding positions of words in the input enterprise research and development resource information text corpus T text, and the list corresponding to k stores the words beginning with k and [ k: j +1] in the text]In prefix dictionary D_fThe list of words in (1) beginning with k and ending with j specifically operates as follows:

a) if the fragment L at a certain position i contains a word frequency P of a position k which is more than 0(P >0), adding the position i to a list taking k as key;

b) if the segment L at a certain position i contains a word at a position k, the word frequency P is equal to 0(P ═ 0), it indicates that the prefix dictionary Df has the prefix, but the statistical dictionary does not have the word, and the loop continues;

b) adding 1 to the position to form a new segment L, wherein the index of the new segment L in the text is [ k: i +1],

3) repeating the step 1) and the step 2), and continuously judging whether the new segment L is in the prefix dictionary D_fFrom middle to upper to transmissionAnd finishing T text traversal of the resource information text corpus in enterprise research and development.

4) And generating a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of the Chinese characters in the T text of the input enterprise research and development resource information text corpus.

And 3, step 3: dynamic planning and searching maximum probability path

Each node of the Directed Acyclic Graph (DAG) constructed in step 2 has a weight of w, which is in the prefix dictionary D_fThe term word frequency in (1); the path weight for the directed acyclic graph DAG may be expressed as route ═ (w)₁,w₂,w₃,...,w_n) To make Σ weight (w)_i) The maximum concrete method is as follows:

and calculating the probability logarithm scores of the clauses [ idx-N-1 ] in a mode of traversing each word (idx) of the text sentences in reverse order from the last word (N-1) of the T text sentences input into the enterprise research and development resource information text corpus.

The case with the highest probability logarithm score is stored in (w) as a tuple of (probability logarithm, last position of word)₁,w₂,w₃,...,w_n) In (1). And finally, solving the maximum probability path.

And 4, step 4: calculating and identifying unknown words

The method adopts a fusion Viterbi algorithm and a Hidden Markov Model (HMM) to calculate and recognize the unknown words. The specific method comprises the following steps:

1) firstly, an enterprise research and development resource information text corpus T is used for training a Hidden Markov Model (HMM), and the frequency is calculated to respectively obtain the initial state probability, the state transition probability and the emission probability of the Hidden Markov Model (HMM) (the related probability calculation method is clearly described and is not repeated herein).

2) And solving the maximum probability logarithm and the optimal path by using a Viterbi algorithm and the known initial state probability, the state transition probability and the transmission probability, and converting the text to be segmented into a state sequence of a BMES type. Wherein the viterbi algorithm flow is specifically shown in fig. 8.

And 5, step 5: outputting the analysis result

Segmenting the internal resource information text of the enterprise by combining the models obtained by training in the steps:

the input enterprise research and development resource information text corpus T text is' CK6132 numerical control machine which belongs to Beijing mechanical industry automation research institute limited company and can be provided for users to be three-level mechanical engineers. "

The result of the word segmentation is "CK 6132/NC machine/belonging to/Beijing/machinery/industry/automation/institute/Limited/,/available/personnel/is/third level/machinery/engineer/. "fig. 3 and fig. 4 are the results of inputting the T text segmentation of the enterprise research and development resource information text corpus and reading the text segmentation, respectively.

3. Enterprise research and development resource information entity identification based on semantic analysis

The method identifies the enterprise research and development resource information entity by the model in a mode of combining a Hidden Markov Model (HMM) model and a viterbi algorithm, and comprises the following specific processes:

step 1: and processing the T text of the enterprise research and development resource information text corpus to be input by using an enterprise research and development resource information text corpus T training model and combining a state sequence result generated in the segmentation of the enterprise research and development resource information text based on semantic analysis.

Step 2: the input enterprise research and development resource information text corpus T text 'Wangchun' is a worker of the design department of the Beijing mechanical industry automation research institute, and is mainly responsible for carrying out mechanical analysis on the interior of an engine by using ANSYS. "

And 4, step 4: the marked and identified enterprise research and development resource information entities are respectively Wangchun (name of people), Beijing mechanical industry automation research institute limited company design department (name of organizational organization) and ANSYS. FIG. 5 shows the results of an enterprise research and development of resource information entities.

4. Enterprise research and development resource information entity relation extraction based on semantic analysis

The sonwball algorithm adopting semi-supervised learning has the basic principle as shown in fig. 6, and specifically comprises the following steps:

step 1: inputting a text to be processed and marking an enterprise research and development resource information entity in the text to be processed, wherein the entity is 'the King of Xiao Lin in the morning of today controls the part processed by a CK6132 numerical control machine tool'. The identified resource information entities are entity 1 'Wangchun' and entity 2 'CK 6132 numerical control machine tool'.

Step 2: defining the length before and after word extraction: defining the length of the word-taking before and after the resource information entity as 2, then the word-taking before (today) and (morning) in the Wangchun of the entity 1, and the word-taking after (operation) and the entity 2; similarly, the entity 2 "CK 6132 numerical control machine tool" takes words in the forward direction [ entity 1, (control) ], and takes words in the backward direction [ (processing), (part) ].

And 3, step 3: and (3) generating a rule: and forming a text to be processed according to the word-taking result before and after the resource information entity: (lift + entity + Middle + entity + Right) structure, and converting the structure into: (word vector + entity class + word vector), denoted as rule (L, T, M, T, R).

And 4, step 4: and (3) calculating rule similarity: for rule 1 (L)₁,T₁,M₁,T₁,R₁) Rule 2 (L)₂,T₂,M₂,T₂,R₂) If T is₁Is not equal to T₂If the rule 1 and the rule 2 have no similarity; otherwise, rule 1 and rule 2 have similarity S ═ W₁ L₁ L₂+W₂ M₁M₂+W₃ R₁ R₂Where W is the weight of the corresponding word vector, and the weight of the intermediate word vector is generally greater.

5. Enterprise research and development resource usage dynamics

Firstly, establishing a stop word corpus

And removing stop words from the obtained segmented word text, wherein the content of the stop word corpus comprises punctuation marks, common words, and words except nouns, verbs, adjectives and adverbs to obtain practical and useful words.

And secondly, combining with a TF-IDF algorithm, automatically extracting key words, and judging and analyzing the use dynamics of enterprise research and development resources.

The corresponding TF-IDF algorithm is formulated as:

TF-IDF ═ word frequency (TF) x Inverse Document Frequency (IDF)

The TF-IDF value of a word is proportional to the frequency of its occurrence in the document and inversely proportional to the frequency of its occurrence in the entire corpus, with a greater TF-IDF value indicating a greater importance of the word to the current document and vice versa. Therefore, the automatic keyword extraction is to calculate the TF-IDF values of all words in the document, and then arrange the words in descending order to take the first few words.

In this patent, a word frequency (TF) is the number of occurrences of a certain enterprise resource information research and development entity word in an enterprise resource information corpus T/the total number of occurrences of the enterprise resource information corpus T, and an Inverse Document Frequency (IDF) is log (the total number of documents in the enterprise resource information corpus/the number of documents including the enterprise resource information entity word + 1). Therefore, the use dynamics of the existing group research and development resources are analyzed according to the extracted enterprise research and development resource information entity vocabulary key words, and reference is made for further extracting the enterprise research and development resource information entity relation.

5. Enterprise research and development resource information entity relationship extraction

The method is characterized in that a T text of an enterprise research and development resource information text corpus is input, wherein the maximum rotation diameter of a CK6132 numerical control machine tool body is 390mm, the CK is affiliated to Beijing mechanical industry automation research institute limited company, and a user can be a third-level mechanical engineer. The solid works is three-dimensional modeling software, supports various operating systems and is stored in a D disk. "

And (4) extracting results: the present invention relates to a method for measuring a rotation angle of a numerical control machine tool, and more particularly to a method for measuring a rotation angle of a numerical control machine tool. As a result, as shown in FIG. 7, it can be seen that the model extracted the relationship between "the maximum revolution diameter of the numerically controlled machine tool" and "390 mm". For the solid work enterprise research and development resource information entity, the model extracts two triples and provides the function of the entity and the supported operating system.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing examples, or equivalent to some of the technical features of the present invention, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A semantic-based enterprise research and development resource information modeling method comprises the following steps:

the fourth step: and (3) calculating rule similarity: for rule 1 (L)₁,T₁,M₁,T₁,R₁) Rule 2 (L)₂,T₂,M₂,T₂,R₂) If T is₁Is not equal to T₂If the rule 1 and the rule 2 have no similarity; otherwise, rule 1 and rule 2 have similarity S ═ W₁ L₁ L₂+W₂ M₁ M₂+W₃R₁ R₂Wherein W1, W2 and W3 are the weights of the corresponding word vectors, and the weight of the intermediate word vector is larger;

(5) the method comprises the following steps of carrying out dynamic analysis on enterprise research and development resources, and analyzing the use condition of resources in an enterprise by utilizing a keyword extraction technology, wherein the method comprises the following steps:

the first step is as follows: establishing a stop word corpus, and removing stop words from the obtained segmented word text, wherein the contents of the stop word corpus comprise punctuation marks, common words, and words except nouns, verbs, adjectives and adverbs to obtain practical useful words;

2. The method of claim 1, wherein the prefix-based dictionary D is based on_fThe method comprises the following steps of realizing word graph scanning, generating a directed acyclic graph DAG formed by all possible word forming conditions of Chinese characters in a T text of an enterprise research and development resource information text corpus, and generating the directed acyclic graph DAG by the following steps:

b) if a segment L at a certain position i contains a word frequency P at position kEqual to 0, this indicates a prefix dictionary D_fIf the prefix exists but the statistical dictionary does not have the word, continuing circulation;

b) adding 1 to the position i to form a new fragment L;

4) and generating a directed acyclic graph DAG formed by all possible word forming conditions of Chinese characters in a T text of the input enterprise research and development resource information text corpus.