CN113515585A

CN113515585A - Construction method, retrieval method and system of special lexicon in dangerous chemical safety field

Info

Publication number: CN113515585A
Application number: CN202010281569.5A
Authority: CN
Inventors: 蒋瀚; 于一帆; 郭峻东; 施红勋; 常庆涛
Original assignee: China Petroleum and Chemical Corp; Sinopec Qingdao Safety Engineering Institute
Current assignee: China Petroleum and Chemical Corp; Sinopec Qingdao Safety Engineering Institute
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2021-10-19

Abstract

The invention provides a construction method, a retrieval method and a system of a professional lexicon in the field of dangerous chemical safety, and belongs to the technical field of chemical safety. The method comprises the following steps: forming a vocabulary entry storage structure, acquiring vocabularies related to dangerous chemical safety specialties, and forming the vocabularies into vocabulary entries of a professional vocabulary bank according to the vocabulary entry storage structure, wherein the vocabulary entry storage structure is used for storing data units of the professional vocabulary bank, the vocabulary entries are defined as the data units of the professional vocabulary bank, and the data units comprise character strings of the vocabularies; and forming a word list of the professional word bank, then constructing an identification tree about the word list and the vocabulary, obtaining an index value corresponding to a character string of the vocabulary through the identification tree, and recording a mapping relation between the index value and the character string or the entry of the vocabulary in the professional word bank. The invention is used for professional vocabulary processing.

Description

Construction method, retrieval method and system of special lexicon in dangerous chemical safety field

Technical Field

The invention relates to the technical field of chemical safety, in particular to a construction method of a professional lexicon, a retrieval method of the professional lexicon, a system for indexing professional vocabularies, equipment for indexing the professional vocabularies and a computer-readable storage medium.

Background

With the development of the chemical industry and the supervision of dangerous chemicals, a large number of laws and regulations, systems, archives and report files are accumulated in the process of the safety supervision of dangerous chemicals and public internet services of related organizations and enterprises, and unstructured data are gradually accumulated and developed into big data. The value of big data is reflected in the aspect of intelligent processing of large-scale data sets, and how to effectively utilize the massive information in the big data becomes a key problem for the development of information technology.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. Natural language processing has become the focus of current attention as cognitive intelligence in the field of artificial intelligence. The text data is processed and analyzed based on the natural language processing technology, so that a large amount of labor cost for inquiring data, arranging documents and extracting information can be saved, the document application can be better carried out, and the documents can be deeply mined and utilized. And the construction of the word segmentation library is the basis of all the natural language processing technologies. The construction of the word segmentation library facing the professional field is important for the natural language processing of the documents in the professional field.

At present, the fields of medicine, electricity and the like have research on construction of professional word segmentation libraries, but the research on the word segmentation libraries in the dangerous chemical safety field is blank, so that the development process of natural language processing technology application in the dangerous chemical safety field is restricted to a certain extent. Moreover, the standardization of the technical terms is not manually performed, and different technical terms referring to the same thing are currently used, and different abbreviations and the like of the same technical term are to be standardized, even standardized, which indicates that the specification of the recorded technical terms or technical vocabularies is lacked.

The bottleneck of applying natural language processing technology and full text retrieval technology to the safety field of dangerous chemicals is the deficiency of a professional lexicon. The invention constructs a special word bank for dangerous chemicals and can fundamentally solve the problem, thereby implementing various applications oriented to text data.

Disclosure of Invention

The invention aims to provide a construction method, a retrieval method and a system of a professional lexicon in the safety field of dangerous chemicals, which solve the technical problems that natural language processing, full-text retrieval and the like are difficult to perform in the safety field of dangerous chemicals due to the loss of professional terms or vocabulary corpora and the loss of corpus normativity and poor applicability of a general lexicon.

In order to achieve the above object, an embodiment of the present invention provides a method for constructing a professional lexicon, where the method includes:

s1), forming a vocabulary entry storage structure, acquiring vocabularies related to dangerous chemical safety specialties, and forming the vocabularies into vocabulary entries of a professional vocabulary library according to the vocabulary entry storage structure, wherein the vocabulary entry storage structure is used for the storage structure of data units of the professional vocabulary library, the vocabulary entries are defined as the data units of the professional vocabulary library, and the data units comprise character strings of the vocabularies;

s2), forming a vocabulary of the professional lexicon, then constructing an identification tree about the vocabulary and the vocabulary, obtaining an index value corresponding to a character string of the vocabulary through the identification tree, and recording a mapping relation between the index value and the character string or the vocabulary entry of the vocabulary in the professional lexicon, thereby completing the construction of the professional lexicon, wherein the vocabulary is defined as a set of the same type of vocabulary entry in the professional lexicon.

Specifically, the forming of the entry storage structure in step S1) includes:

defining professional lexical items, synonym items, part of speech items and word frequency items, and forming a lexical item storage structure through four fields of the professional lexical items, the synonym items, the part of speech items and the word frequency items, wherein,

the word classification method comprises the steps that professional terms are used for recording character strings of professional vocabularies, the synonym terms are used for recording character strings of synonyms of the professional vocabularies, part-of-speech terms are used for recording word classification of the professional vocabularies, word frequency terms are used for recording the occurrence times of the professional vocabularies, and the professional vocabularies are at least one vocabulary in the obtained vocabularies.

Specifically, the vocabulary about the safety specialization of the dangerous chemical obtained in the step S1) includes:

using a word segmentation device to segment the document data related to the safety major of the dangerous chemicals, and obtaining at least words, word classifications of the words and the occurrence times of the words after word segmentation;

and screening the vocabulary, and obtaining the vocabulary related to the safety major of the dangerous chemicals after screening and using the vocabulary as the obtained vocabulary.

Specifically, before using the tokenizer to tokenize the document data about the hazardous chemical safety specialty in step S1), the method further includes:

acquiring text data, wherein the text data comprises: text data with dangerous chemical accident detail information, text data with dangerous chemical enterprise registration information, text data with dangerous and operability research report information, text data with safety instrument system grading and verification research report information, text data with legal and regulatory information related to dangerous chemical safety and text data with national chemical standard information;

converting the text data into a text document in the same data format;

and removing the text documents with the text length smaller than the length threshold value, and then performing document processing operation on the reserved text documents to obtain the document data related to the dangerous chemical safety specialties, wherein the document processing operation comprises the steps of removing typesetting format information, removing punctuation marks and removing stop words.

Specifically, in step S1), forming the vocabulary into entries of a professional lexicon according to the entry storage structure, including:

extracting vocabularies from the acquired vocabularies, and combining the extracted vocabularies to obtain professional vocabularies, or extracting the vocabularies from the acquired vocabularies and taking the extracted vocabularies as the professional vocabularies;

according to different fields of the entry storage structure, recording character strings of the professional vocabularies in the professional vocabularies, classifying words corresponding to the professional vocabularies in the part of speech items, and recording the occurrence times corresponding to the professional vocabularies in the word frequency items to form entries of a professional vocabulary bank.

Specifically, the step S1) extracts words from the acquired words, and combines the extracted words to obtain a professional word, including:

extracting any two vocabularies which are adjacent in sequence from the acquired vocabularies, wherein the any two vocabularies are identified with noun identifiers and verb identifiers, the noun identifiers and verb identifiers of the any two vocabularies can be arranged in any combination, namely the sequential combination of the noun identifiers and the noun identifiers, the sequential combination of the noun identifiers and the verb identifiers, the sequential combination of the verb identifiers and the noun identifiers, the sequential combination of the verb identifiers and the verb identifiers, and the sequential combination of the verbs and the verb identifiers, namely the semantic sequence of the any two vocabularies when the words are extracted;

and keeping the semantic sequence of any two vocabularies when being extracted, combining the two vocabularies to be used as a candidate professional vocabulary, and obtaining the professional vocabulary after manually interpreting the candidate professional vocabulary in consideration of the cost and other reasons.

recording the character strings of the vocabularies in the professional lexical items, adding words corresponding to the vocabularies into the part of speech items, initializing the occurrence times corresponding to the vocabularies by default values and recording the initialized occurrence times in the word frequency items to form entries of a professional lexicon, wherein the vocabularies are obtained from structured form data.

Specifically, the vocabulary in step S2) includes:

the system comprises a fixed device name word list, a department and/or office full name and/or word list for short, a chemical enterprise plate word list, a staff public work word list, a staff name word list, a administrative district word list, an organization full name and/or word list for short, an accident type word list, a refining device category word list, a ten-major risk word list, a physical medium word list, an enterprise full name and/or word list for short, a revenue creating means word list, a honor name word list, an accident briefing type word list, a law and regulation type word list, a dangerous chemical directory word list, a risk point name word list, a processing unit name word list, an accident death reason word list and a risk area word list.

Specifically, the step S2) of constructing the identifier tree about the vocabulary and the vocabulary includes:

assigning a first identifier corresponding to the vocabulary, assigning a second identifier corresponding to a number of words of the vocabulary, and assigning a third identifier corresponding to a string of the vocabulary, wherein the second identifier comprises the first identifier and a first characteristic, and the third identifier comprises the second identifier and a second characteristic;

an identification tree is obtained by using a first identifier as an identifier of a first layer node according to a first classification formed by the vocabulary for the vocabulary and a second identifier formed by the word count of the vocabulary in the first classification, using the second identifier as an identifier of a second layer node relative to the first layer node, and using the third identifier as an identifier of a leaf node relative to the second layer node, and configuring an identifier of a root node.

Specifically, after the constructing the identifier tree about the vocabulary and the vocabulary in step S2), and before obtaining the index value corresponding to the character string of the vocabulary through the identifier tree, the method further includes:

determining a current vocabulary and a character string of the current vocabulary according to the identifier in the third identifier corresponding to the identifier of the current leaf node, wherein the current leaf node is any one of the leaf nodes;

sorting identifiers of all nodes except the identifier of the root node in the identification tree, and forming a feature vector corresponding to the character string of the current vocabulary after sorting, wherein the columns of the feature vector are defined as the identifiers of the nodes in the identification tree, and each column has an identification value;

and inquiring identifiers of nodes on a path from the identifiers of the current leaf nodes to the identifiers of the root nodes, changing the identification values of the columns in the characteristic vectors corresponding to the identifiers of the nodes on the path into first characteristic numerical values, and then changing the identification values of the columns which are not changed into second characteristic numerical values to obtain changed characteristic vectors.

Specifically, the obtaining of the index value corresponding to the character string of the vocabulary through the identification tree in step S2) includes:

recording the obtained modified eigenvectors, and forming an eigenvector matrix through all the modified eigenvectors;

and according to the relative position of the modified characteristic vector corresponding to the character string of the vocabulary in the characteristic vector matrix and the relative position of the first characteristic numerical value, carrying out compression calculation on the modified characteristic vector corresponding to the character string of the vocabulary to obtain an index value corresponding to the character string of the vocabulary.

Specifically, in step S2), performing compression calculation on the modified feature vector corresponding to the character string of the vocabulary, where the compression calculation includes:

calculating the numerical sum F (n)_i)：

Wherein the numerical values and F (n)_i) Is defined as the 1 st column to the n th column in the current feature vector_iSum of identification values of column vectors of columns, m_jA column vector, n, for the j-th column of the current feature vector_iThe number of the column vector in the current eigenvector is the digit number of the column vector, and the current eigenvector is any one eigenvector in the eigenvector matrix;

determining the numerical sum F (n)_i) When the number is more than or equal to i, the digit number n is set_iMapping to an index value I_WMiddle ith positioning value N_iWherein the index value I_WIs defined as (N)₁，N₂，N₃)。

The embodiment of the invention provides a search method of a professional lexicon, which comprises the following steps:

acquiring character strings of an identification text, and inquiring index values matched with the character strings of the identification text in the professional lexicon through the mapping relation;

and using the searched index value as a word vector of the recognition text, and returning the word vector to the request end, or,

and returning the entries or vocabularies corresponding to the inquired index values in the professional word stock to the request end.

The embodiment of the invention provides a system for professional vocabulary indexing, which comprises:

the system comprises a professional word bank and a plurality of word units, wherein the professional word bank is provided with a word entry storage structure and is used for acquiring words related to dangerous chemical safety specialties and then forming the words into the word entries of the professional word bank according to the word entry storage structure, the word entry storage structure is used for storing data units of the professional word bank, the word entries are defined as the data units of the professional word bank, and the data units comprise character strings of the words;

the professional lexicon is also provided with a word list and an identification tree related to the word list and the vocabulary, and is used for obtaining an index value corresponding to the character string of the vocabulary through the identification tree and recording the mapping relation between the index value and the character string or the entry of the vocabulary in the professional lexicon.

In another aspect, an embodiment of the present invention provides an apparatus for professional vocabulary indexing, including:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implements the aforementioned method by executing the instructions stored by the memory.

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions, which, when executed on a computer, cause the computer to perform the foregoing method.

The invention determines the range related to professional vocabularies in the word segmentation library, sets the classification method and the organization mode of the entries in the word segmentation library, forms the index mode of each word list and the entries in the word list, realizes the calling mechanism of the word lists in the word segmentation library, and can support the calling of different word lists aiming at texts of different articles and the like;

the invention can improve the value of basic data in a big data system and further support the application of the related natural language processing technology in the safety field of dangerous chemicals by establishing a word bank with perfect field coverage, reasonable classification mode, higher index efficiency and correct calling logic.

The invention constructs a dangerous chemical safety professional word bank (or professional word bank), can provide basic data for the deep utilization of document data, for example, aiming at the linguistic characteristics of laws and regulations, archives, documents, accident cases and news, different vocabulary entries are combined and natural language processing technology is combined, and the obtained statistical data is automatically fragmented and analyzed, so that the idle text data is utilized;

the professional word-dividing library can assist the construction of the knowledge graph, for example, all professional words in the word-dividing library are classified according to entity meanings, can be directly imported into a relational database on which the knowledge graph depends, and can be quickly organized and generated after relational information is added, so that the query focusing on relations is allowed;

the professional word segmentation library can enable full-text retrieval service aiming at the chemical safety field to be possible, for example, after text information is fragmented through the word segmentation library and the word segmentation algorithm, the query of the text is not limited to a plurality of key words representing the text but the information contained in the full text, and the full-text retrieval technology realized based on the word segmentation library can meet the requirements of quick retrieval and high-precision retrieval.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a schematic diagram of the main method steps of an embodiment of the present invention;

FIG. 2 is a schematic diagram of exemplary method steps of an embodiment of the present invention;

FIG. 3 is a diagram illustrating an organization structure of a professional lexicon according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an identification tree of a thesaurus according to an embodiment of the present invention;

FIG. 5 shows the feature vector M of word a according to the embodiment of the present invention₁A schematic diagram;

FIG. 6 is a partial diagram of a thesaurus identification tree according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating feature vectors of vocational words according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

Example 1

As shown in fig. 1, an embodiment of the present invention provides a method for constructing a professional lexicon, where the method includes:

s2), forming a vocabulary of the professional lexicon, then constructing an identification tree about the vocabulary and the vocabulary, obtaining an index value corresponding to a character string of the vocabulary through the identification tree, and recording a mapping relation between the index value and the character string or the vocabulary entry of the vocabulary in the professional lexicon to complete the construction of the professional lexicon, wherein the vocabulary is defined as a set of the same type of vocabulary entry in the professional lexicon.

In some implementations, the entry storage structure may be a storage structure for a minimum data unit in the specialized thesaurus, and the entry may be the minimum data unit; the vocabulary related to the dangerous chemical safety major is obtained by obtaining the vocabulary through data containing characters, such as structured data, semi-structured data, text document data and the like; the character string of the vocabulary may be the vocabulary itself, for example, the character string may be composed of kanji characters, or may be composed of alphabetic characters and kanji characters.

The forming of the entry storage structure may include: defining professional lexical items, synonym items, part of speech items and word frequency items, and forming a lexical item storage structure through four fields of the professional lexical items, the synonym items, the part of speech items and the word frequency items, wherein,

The specific form of the entry storage structure may be a list or vector expression, for example, a list expression is used, and a professional term, a synonym term, a part of speech term and a frequency term may be used as four fields of the entry, respectively, as shown in table 1.

Table 1 entry storage structure

Professional word

Synonyms

Part of speech

Word frequency

The word frequency term statistics is based on the occurrence times of vocabularies recognized by natural language processing application developed by the professional lexicon, and the part-of-speech term is a part of speech divided according to the characteristics of the professional words; when an entry is empty, each entry may have no content or be taken as a default empty identifier, and when an entry is non-empty, at least the professional entry should have data.

The obtaining of the vocabulary for safety specialties of hazardous chemicals may include: using a word segmentation device to segment the document data related to the safety major of the dangerous chemicals, and obtaining at least words, word classifications of the words and the occurrence times of the words after word segmentation;

A word segmentation device (or a word segmentation tool) for obtaining a maximum probability path in the directed acyclic graph by using a dynamic programming algorithm can be selected based on the directed acyclic graph of the input text; the screening process can be participated by a regular function, for example, the regular function can call a general dictionary, such as a daily dictionary, to perform comparison screening, so as to filter words with too small relevance to the chemical engineering safety field.

Before the using the tokenizer to tokenize the document data about the dangerous chemical safety specialty, the method may further include:

acquiring text data, wherein the text data comprises: text data with dangerous chemical accident detail information, text data with dangerous chemical enterprise registration information, text data with dangerous and operability research report information, text data with safety instrument system grading and verification research report information, text data with legal and regulatory information related to dangerous chemical safety and text data with national chemical standard information; converting the text data into a text document in the same data format; and removing the text documents with the text length smaller than the length threshold value, and then performing document processing operation on the reserved text documents to obtain the document data related to the dangerous chemical safety specialties, wherein the document processing operation comprises the steps of removing typesetting format information, removing punctuation marks and removing stop words.

The same data format may be the same encoding format, such as UTF-8 character encoding; the length threshold is not limited, but can be selected according to a specific data format, for example, for UTF-8 character encoding, 10 character length, 30 character length, 50 character length, etc. can be selected; the text data may include text documents of various format types, such as PDF documents, JSON documents, DOC documents, and the like; typesetting format information such as delimiters, linefeeds, and paginations, punctuation marks such as commas, periods, and semicolons; stop words such as "o", "of", "eight", etc.

The forming the vocabulary into entries of a professional lexicon according to the entry storage structure may include:

For the extraction manner, for example, all the currently acquired words may be ranked by character length, and selectively extracted by the part of speech identifier of the word, for example, a noun word below a word length threshold (for example, two characters) may be extracted.

For the combination mode, extracting vocabularies from the obtained vocabularies, and combining the extracted vocabularies to obtain professional vocabularies may include:

maintaining the semantic sequence of any two vocabularies when being extracted, combining the two vocabularies to be used as a candidate professional vocabulary, wherein the candidate professional vocabulary can be manually interpreted to obtain a professional vocabulary;

in some cases, it is also possible to generate candidate specialized words corresponding to the any two words input into the deep learning model or identify whether the any two words are candidate specialized words according to the context meanings of the any two words through the trained deep learning model.

The structured form data can be structured data such as a risk classification database, a safety protection guide database, a dangerous chemical substance catalog and the like, the structured data can be directly recorded by a professional lexicon, and a word segmentation tool is assisted to segment document data, so that the recording efficiency of the professional lexicon is improved.

The formed word list can comprise: the system comprises a fixed device name word list, a department and/or office full name and/or word list for short, a chemical enterprise plate word list, a staff public work word list, a staff name word list, a administrative district word list, an organization full name and/or word list for short, an accident type word list, a refining device category word list, a ten-major risk word list, a physical medium word list, an enterprise full name and/or word list for short, a revenue creating means word list, a honor name word list, an accident briefing type word list, a law and regulation type word list, a dangerous chemical directory word list, a risk point name word list, a processing unit name word list, an accident death reason word list and a risk area word list.

A fixed equipment name vocabulary, such as aniline plant, benzene extraction plant, phenol-acetone plant, phthalic anhydride plant, closed discharge tank, and polypropylene plant;

chemical enterprise plate vocabularies, such as refinery plates, stone plates, sales plates, oil field plates and professional company plates;

staff vocabularies such as safety management staff, safety officers, pumpers, gas producers, oil producers, operators, captain with shift, and locomotive deputys;

ten major risk vocabularies, such as personnel concentration sites in a plant area, pipeline laying, ocean platforms, oil geophysical prospecting, hazardous chemical tank areas, oil and gas wells, oil and gas testing and downhole operations;

a list of physical media such as styrene, propylene, residual oil, alcohol-containing wastewater, hydrogen sulfide-containing lean diethanolamine, hydrogen sulfide-containing acidic water, and mixed xylenes;

creating an income means vocabulary, such as safety standardized consultation guidance, safety simulation training, safety evaluation, safety intelligent monitoring, corrosion evaluation and process safety and reaction evaluation;

a honor name word list such as a technical invention prize, a scientific and technological achievement prize, a scientific and technological progress prize, a scientific and technological innovation prize, a scientific and technological progress prize, and a scientific and technological progress prize;

accident-briefing type vocabularies, such as overhead falling, DCS malfunction, security, explosion, unplanned outages, traffic accidents, lifting, collapsing, leaking, catching fire, and natural disasters;

and a list of names of treatment units, such as # 1 hydrocracking unit, # 1 gas holder, # 1 gas desulfurization unit, # 1 sour water stripping unit, # 2 sewage treatment unit, # 2 adsorption separation unit and # 2 delayed coking unit.

The constructing of the identification tree about the vocabulary and the vocabulary may include:

In some implementations, the first and second signatures may be sequential letters or numbers, or consist of letters and numbers in a predefined order; in some cases, the identifier of the root node may be ignored, but for uniformity, the identifier of the root node may be determined first, and then the first identifier may be assigned based on the identifier of the root node; for example, if the identifier of the root node is determined to be a, then in the first level nodes of the identification tree (the root node, the first level nodes, the second level nodes and the leaf nodes are nodes of the identification tree under different categories, each node corresponding to an identifier), the first identifier is a₁，A₂，A₃，…，A_mFurther, due to the constraints of the first class and the second class, for a in the first identifier within the nodes of the second level of the identification tree₁And A_mAnd A is₁The corresponding second identifier has a portion₁01，A₁02，A₁03，…，A₁0n, 0n may be a first signature, and A_mThe corresponding second identifier has a portion_m01，A_m02，A_m03，…，A_m0n, further, for A in the portion of the second identifier₁01 and A₁0n, and A₁01 corresponds to a third identifier having a portion A₁0101，A₁0102，A₁0103，…，A₁010p, with A₁0n corresponds to a portion of the third identifier of A₁0n01，A₁0n02，A₁0n03，…，A₁0n0p, 0p may be a second signature, where m, n, and p are all positive integers greater than zero.

Index values can be realized by constructing a feature vector and a feature matrix; after the building of the identification tree about the vocabulary and before the obtaining of the index value corresponding to the character string of the vocabulary through the identification tree, the method further comprises:

inquiring identifiers of nodes on a path from the identifier of the current leaf node to the identifier of the root node, changing the identification values of columns in the characteristic vectors corresponding to the identifiers of the nodes on the path into first characteristic numerical values, and then changing the identification values of columns which are not changed into second characteristic numerical values to obtain changed characteristic vectors;

the method can be regarded as one time in a circulation process, and each vocabulary constructs a corresponding feature vector, so that circulation can be regarded as completed; for convenience of calculation, the sorting may be performed by sequentially forming a small group of column vectors by an identifier of a certain term table (one selected identifier out of first identifiers), an identifier of different words corresponding to the identifier of the certain term table (a plurality of identifiers out of second identifiers corresponding to the one selected identifier), and an identifier of different words corresponding to the identifier of one of the different words (a plurality of identifiers out of third identifiers corresponding to the one selected identifier out of the plurality of identifiers out of the second identifiers), forming a plurality of small group of column vectors by the sorting for each term table, sorting the plurality of small group of column vectors by the existing order of the term table to form a feature vector corresponding to the current term, sorting the feature vectors thus obtained, the relative position of each column vector can be described by digit or digit number;

in the feature vector, the identification value may initially be defaulted to a custom value or may be a random value, for simplicity of calculation, the first feature value may be 1, and the second feature value may beThe changed identification value is only changed into a first characteristic numerical value or a second characteristic numerical value in numerical value and can also be continuously called as an identification value, so that the identification value is 0; because the third identifier is directly associated with the first identifier and the second identifier, a backtracking path can be found out in nodes of different layers in the identification tree according to the symbolic feature of the third identifier, each layer can backtrack the identifier of only one node, and then backtrack the identifiers of other nodes in the identification tree layer nodes with lower layers on the basis of the backtracked identification tree layer nodes; for example, for a word and a feature vector corresponding to the word, the identifier is A in the third identifier corresponding to NODE NODE1 in the leaf NODE in the identification tree₁0103, from the first and second signatures 01 and 03, the identifier of NODE NODE2 in the second level NODE involved can be derived as A₁01 (one of the second identifiers), and the identifier of NODE3 in the first level NODE concerned is A₁(one of the first identifiers), the NODEs on the path are the NODE1 to the NODE3, and the identification value of the NODE on the path may be changed to 1, that is, in the feature vector corresponding to the certain vocabulary, the identification value of the column vector corresponding to the identifier of the NODE on the path is changed to 1, and at the same time, the identification values of the remaining unmodified column vectors of the feature vector are changed to 0, thereby completing the feature vector construction corresponding to the certain vocabulary and having the changed identification value.

The obtaining, through the identification tree, an index value corresponding to a character string of the vocabulary may include:

For the relative position in the feature vector matrix, the relative position can be used for the relative basis composition of the participation index value and the rest index values, and the relative position of the first feature value can be used for the relative basis composition of the positioning value in the participation index value; for example, for a given eigenvector, the first subset of column vectors occupies 5 digits, then the first location values of the index values of the second subset of column vectors may all be 6, and the remaining location values may be associated with the relative positions of the first eigenvalue of a given column vector within the group.

Performing a compression calculation on the modified feature vectors corresponding to the character strings of the vocabulary, wherein the compression calculation may include:

calculating the numerical sum F (n)_i)：

Number n_iThe numerical sum F (n) is obtained each time a specific number of digits is required_i) After the comparison with I is completed, when the comparison result is greater than or equal to I, the specific digit number currently assigned is mapped into the index value I_WMiddle ith positioning value N_iThen, the digit number may be added with 1, and the above calculation value and process and determination operation with respect to (specific digit number +1) may be repeated until all the digit numbers or the mapping of the third positioning value of one index value is completed, and then the jump-out and the end are performed;

will number n_iMapping to an index value I_WMiddle ith positioning value N_iIllustratively, it may be a sequence number n_iIs equal to the index value I_WMiddle ith positioning value N_iOr the number n of digits_iAs a linear variable, an index value I is formed using a linear function mapping_WMiddle ith positioning value N_i。

Example 2

Based on the embodiment 1, the embodiment of the invention provides a search method of a professional lexicon, which comprises the following steps:

The request terminal can be an application service using a professional lexicon, such as a full-text search engine and a word segmentation model in natural language processing, and the character strings for identifying the text can be obtained by the full-text search engine and the word segmentation model, so that the professional lexicon can be utilized to train the word segmentation model and provide professional lexicon support for the full-text search engine;

because the index value and the character string or the vocabulary entry have a mapping relation, for example, the mapping relation can be a table corresponding relation, or a relation of mutual query through a function, the function can be a characteristic vector converted from the index value, and then the character string or the vocabulary entry is queried through the characteristic vector, the conversion mode can be that the index value is used as the digit number of the non-zero column vector of the characteristic vector, the digit number is used for query positioning in the characteristic vector matrix, and the characteristic vector corresponding to the index value, and the vocabulary entry and the character string of the vocabulary associated with the characteristic vector are obtained.

Example 3

Based on embodiment 1, an embodiment of the present invention provides a system for professional vocabulary indexing, where the system includes:

In some implementations, the system may also include an application service, such as an ElasticSearch engine or a Solr search engine.

In some implementations, the entry storage structure in the thesaurus can be configured to have four fields of a professional term, a synonym term, a part of speech term, and a term of frequency, wherein,

In some implementations, the system can further include a tokenizer, which can be configured to tokenize document data relating to a hazardous chemical safety specialty, and obtain at least a vocabulary, a word classification of the vocabulary, and a number of occurrences of the vocabulary after the tokenization;

the system may further include a filter that may be configured to obtain a vocabulary relating to the safety profession of the hazardous chemical from the vocabulary and to serve as the obtained vocabulary, the filter may be configured by having functions of a character selection function and a character judgment function, and the like.

In some implementations, the system can further include a preprocessor; the preprocessor may be configured to obtain text data, wherein the text data includes: the method comprises the steps of converting text data into a text document in the same data format, removing the text document with the text length smaller than a length threshold value, and then performing document processing operation on the reserved text document to obtain document data about dangerous chemical safety specialties, wherein the document processing operation comprises removing typesetting format information, removing punctuation marks and removing stop words.

In some implementations, the system can further include an extractor that can be configured to extract words from the captured words and combine the extracted words to obtain the specialized words, or extract words from the captured words and combine the extracted words as the specialized words,

according to different fields of the entry storage structure, recording character strings of the professional vocabularies in the professional vocabularies, classifying words corresponding to the professional vocabularies into the part of speech items, and recording the occurrence times corresponding to the professional vocabularies in the word frequency items to form entries of a professional vocabulary bank; the extractor may be constructed by functions having a character selection function and a character judgment function, and in some cases, these functions may be non-linear, and the extractor may be constructed by a trained deep learning model, for example.

The extractor is specifically configured to extract any two vocabularies which are adjacent in sequence from the acquired vocabularies, where the any two vocabularies are identified with noun identifiers and verb identifiers, and an arrangement sequence of the noun identifiers and the verb identifiers of the any two vocabularies may be any combination, that is, a sequential combination of the noun identifiers and the noun identifiers, a sequential combination of the noun identifiers and the verb identifiers, a sequential combination of the verb identifiers and the noun identifiers, a sequential combination of the verb identifiers and the verb identifiers, and a sequential combination, that is, a semantic sequence when the any two vocabularies are extracted;

keeping the semantic sequence of any two vocabularies when being extracted, and combining the two vocabularies to be used as a candidate professional vocabulary;

in some cases, the operation steps performed by the extractor may be performed directly by a human without using the extractor, for example, the professional vocabulary may be obtained after the candidate professional vocabulary is interpreted manually.

In some implementations, the specialized thesaurus is specifically configured to record character strings of the vocabulary in the specialized vocabulary entry, add a word corresponding to the vocabulary to the part-of-speech entry, initialize the occurrence number corresponding to the vocabulary with a default value, and record the initialized occurrence number in the frequency term to form a term of the specialized thesaurus, where the vocabulary is obtained from the structured form data.

In some implementations, the vocabulary in the specialized thesaurus includes:

In some implementations, the specialized thesaurus also has the identification tree constructed in example 1.

In some implementations, the professional lexicon further has the feature vector and the character string of the vocabulary constructed in example 1, and the entry or the character string of the vocabulary corresponding to the feature vector is recorded by the index value corresponding to the feature vector.

In some implementations, the specialized thesaurus is configured with a compression calculation rule for obtaining an index value by a feature vector;

wherein the compression calculation rule comprises:

calculating the numerical sum F (n)_i)：

Example 4

Based on embodiment 1 and embodiment 2, an embodiment of the present invention provides a search system for a professional lexicon, including:

a retrieval engine configured to obtain a character string of a recognition text, and query an index value matched with the character string of the recognition text in the professional lexicon in embodiment 1 through the mapping relationship in embodiment 1;

the retrieval engine is configured to use the queried index value as a word vector of the recognized text and return the word vector to a request end, or return a vocabulary entry or a vocabulary corresponding to the queried index value in the professional lexicon to the request end.

Example 5

Based on embodiment 1, the embodiment of the invention provides a construction method of a professional lexicon in the safety field of dangerous chemicals, aiming at the problem that the conventional general lexicon cannot support the application of a natural language processing technology in the safety field of dangerous chemicals. The construction method can be used for organization forms of a professional lexicon (which may be referred to as a lexicon for short) and feature vector indexes of the lexicon, as shown in fig. 2, and the construction method can include:

s01) obtaining relevant documents of dangerous chemical safety;

s02) constructing professional word bank entries;

s03) constructing a professional word bank word list;

s04) forming a professional lexicon identification tree;

s05) calculates a professional vocabulary index value.

The organization form of the word stock mainly comprises a word stock word list and word stock entries. The entry is the minimum data unit of the thesaurus, and the composition is shown in table 1 of embodiment 1, and all the parts of speech and corresponding symbols involved are shown in table 2.

TABLE 2 part of speech and corresponding symbols of professional words

Organization name	no	Name of chemical	nc
				Place name	np	Name of a person	nn
Other terms	nz	Verb and its usage	v
				Adjectives	adj,a,ad	Adverb	adv,d
Structural aid	uj	Name of Chinese	nrt
				Azimuth word	f	Conjunction word	c
Status word	z	Word aid	u
				Digit word	m	Pronouns	r
Noun (name)	n	Space word	s

The thesaurus vocabulary divides the thesaurus into a plurality of categories representing different entity meanings, and the vocabulary is composed of a plurality of entries and named according to the entity meanings of the entries contained in the vocabulary. The thesaurus, the vocabulary and the entries are organized as shown in fig. 3.

The word stock contains 161262 dangerous chemical safety professional vocabularies in total, and the words are divided into 31 entries according to the actual meanings of the words. The statistical details of the thesaurus vocabulary are shown in table 3.

TABLE 3 thesaurus number statistics statement

The feature vector index of the lexicon is derived from the lexicon identification tree, the structure of which is shown in fig. 4. The root node of the identification tree does not participate in the formation of the feature vector and can be arbitrarily defined. All word lists in the word bank are used as first-layer nodes of the identification tree, the word number of the words is used as second-layer nodes of the identification tree, and all the words are used as leaf nodes. From the identification tree, a feature vector matrix W (K + M N) of the word stock can be constructed₀+ M)), where K is the number of all words in the lexicon, M is the number of vocabularies in the lexicon, N is₀The number of words of the longest vocabulary in the lexicon. The strategy for constructing the characteristic vector of the vocabulary is that for a certain leaf node, the leaf node is searched upwards layer by layer to a root node, the values of the leaf node and the passed non-leaf node in the corresponding column of the vector are all written as 1, and the values of the rest columns are all written as 0, so that the characteristic vector of the vocabulary can be formed. In the same way, the feature vectors of all words in the lexicon can be found. And accumulating the obtained feature vectors, wherein the feature vector of each vocabulary occupies one line, and thus a feature vector matrix W of the word bank can be formed.

Because the word bank has a huge number of words, the feature vector dimension (row/column number) generated by the method is high, and therefore a compression and recovery mechanism needs to be established to accelerate the access process. For feature vectors of A_W(m₁,m₂,m₃…), index value I of the compressed vocabulary_W(n₁,n₂,n₃) This can be derived from the following equation (1):

for example, as shown in FIG. 5, the feature vector is M₁The vocabulary a is searched from the root node until the nodes passed by a are all written as 1 in the feature vector of a, and the rest values are 0. The meaning of the index value of the vocabulary a is that starting from the left, all the digits of the feature vector other than 0 number are used to obtain I by using the formula (1)_aIs (1, 2, 5). Vice versa, the feature vector of the vocabulary can be quickly restored according to the meaning of the index value.

The embodiment of the invention fills the blank of the professional lexicon in the safety field of the dangerous chemicals at present, designs the architecture, the content and the organization mode of the professional lexicon in the safety field of the dangerous chemicals, and a text analysis tool constructed based on the professional lexicon can more accurately identify the vocabulary, the named entities and the relations among the named entities in the safety field of the dangerous chemicals;

the embodiment of the invention realizes the statistics of vocabulary, parts of speech, synonyms and word frequency in the special word bank in the safety field of dangerous chemicals, constructs a word vector calculation and compression mode of the special word bank in the safety field of dangerous chemicals, and is beneficial to the use and storage of the word bank.

Example 6

Based on embodiment 5, this embodiment provides a method for constructing a professional lexicon in the field of hazardous chemical safety. The construction method comprises the following steps: the specific implementation mode for constructing the professional lexicon is as follows:

for data collection and preprocessing, dangerous chemical safety text data is collected, such as: the accident details of the dangerous chemicals, the registration information of enterprises of the dangerous chemicals, the research report of danger and operability, the grading and verification research report of a safety instrument system, the relevant laws and regulations of the safety of the dangerous chemicals and the national standard.

And uniformly converting the collected PDF, doc and docx documents into UTF-8 coded text documents, removing the documents with the text length smaller than a threshold value, removing format information of the documents, and removing punctuation marks and stop words.

For the construction of the vocabulary entry, the acquisition modes of the professional vocabulary are mainly the following two types:

A. the method is characterized in that structured form data are directly acquired from a database system (such as a chemical electronic book database system, such as a chemical dictionary database) to acquire relevant professional vocabularies related to the safety of dangerous chemicals, for example: burn and scald, poisoning, occupational disease, collision, rubbing and squeezing, high falling, mechanical injury, deflagration, detonation, high falling, leakage, unplanned shutdown, security, liquefied gas, naphtha and the like. The professional vocabulary acquired according to the method can directly form entries after adding the part of speech items and initializing the word frequency items.

B. Training a word segmentation tool based on a directed acyclic graph and dynamic planning, performing word segmentation on the preprocessed document data to generate word segmentation words, screening out professional words related to the safety of dangerous chemicals, and adding the professional words into a word bank. In order to reduce repeated entry of word segmentation vocabularies and existing professional vocabularies, the existing professional vocabularies can be used as a user additional dictionary and added into a word segmentation tool to assist word segmentation.

For example, the descriptions in hazardous chemical business incident reports are: the fire accident of the arene combination device causes the damage of the bearing, the sealing, the inlet and outlet pipelines, the nearby pipelines, the cables, the pipe gallery structure and the like of the bottom pump of the reforming oil separation tower. The direct reason is that the thrust bearing at the non-driving end of the bottom pump of the reforming oil separation tower is damaged, so that the shaft vibrates violently and displaces, and the serious damage of the mechanical seal of the two poles at the non-driving end of the reforming oil separation tower causes leakage.

The result of segmenting the text by using the segmentation tool is as follows: arene _ n/union _ v/device _ n/fire accident _ n/cause _ v/reform _ n/generate _ v/oil _ n/split _ v/tower _ nrt/bottom pump _ n/' -uj/bearing _ n/seal _ v/and _ c/import-export _ n/pipeline _ n/and _ c/nearby _ f/pipeline _ n/cable _ n/and _ c/pipe _ n/corridor _ n/structure _ n/iso _ u/damage _ v/direct _ ad/cause _ n/be _ v/reform _ n/generate _ v/oil _ n/split _ v/tower _ nrt/bottom pump _ n/non _ d/drive _ n/z/stop _ v/push _ v/bearing _ n/damage v/dam of end Cause _ v/shaft _ n/fierce _ a/vibration _ n/and shaft _ nz/displacement _ v/cause _ v/leak _ v _ uj/severe _ a/damage _ v/cause _ v/for _ v/the _ r/pump _ n/non _ d/drive _ n/end _ z/two poles _ m/machine _ n/seal _ s/. The vocabulary which can not be correctly recognized by the existing word segmentation tool, such as an aromatic hydrocarbon combination device, a reformed oil separation tower, a tower bottom pump, a pipe gallery structure and the like, is extracted to form a vocabulary entry.

For constructing a vocabulary, the collected terms are classified according to their entity meanings in the field of hazardous chemical safety. The vocabularies contained in the professional lexicon are fixed device names, organization department rooms, chemical enterprise plates, personnel public jobs, employee names, Chinese administrative divisions, organization organizations, accident types, refining device categories, ten major risks, physical media, enterprise departments, income creating means, honor names, accident briefing words, platform risk areas, law and regulation words, service company names, dangerous chemical catalogs, refining device risk areas, risk point names, gathering and transportation device risk areas, processing unit names, pipeline risk areas, gas station risk areas, tank area risk areas, oil field block risk areas, warehouse risk areas, oil depot risk areas, well risk areas and accident death reasons.

The formed word lists are numbered, then the professional vocabularies in each word list are ordered according to the number of words, and a word stock identification tree is formed as shown in fig. 6.

For creating word vectors and indexes, feature vectors of each professional vocabulary can be generated according to the created word stock identification tree, as shown in fig. 7. According to the feature vector matrix and the calculation formula, the index values of the vocabulary can be calculated as follows: explosion (6, 8, 9), fire (6, 8, 10), leakage (6, 8, 11), falling object (6, 13, 14), traffic accident (6, 13, 15), natural disaster (6, 13, 16), unplanned shutdown (6, 17, 18).

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solutions of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications all belong to the protection scope of the embodiments of the present invention.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention do not describe every possible combination.

Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In addition, any combination of various different implementation manners of the embodiments of the present invention is also possible, and the embodiments of the present invention should be considered as disclosed in the embodiments of the present invention as long as the combination does not depart from the spirit of the embodiments of the present invention.

Claims

1. A construction method of a professional lexicon is characterized by comprising the following steps:

s2), forming a vocabulary of the professional lexicon, then constructing an identification tree about the vocabulary and the vocabulary, obtaining an index value corresponding to a character string of the vocabulary through the identification tree, and recording a mapping relation between the index value and the character string or the vocabulary entry of the vocabulary in the professional lexicon, wherein the vocabulary is defined as a set of the same type of vocabulary entry in the professional lexicon.

2. The method for constructing a professional thesaurus according to claim 1, wherein the forming of the entry storage structure in step S1) comprises:

3. The method for constructing a thesaurus of specialties according to claim 2, wherein the step S1) of obtaining words related to safety specialties of dangerous chemicals comprises:

4. The method for constructing a professional thesaurus according to claim 3, wherein in step S1), before using a tokenizer to tokenize the document data about the dangerous chemical safety professional, the method further comprises:

converting the text data into a text document in the same data format;

5. The method as claimed in claim 3, wherein the step S1) of forming the vocabulary into entries of the professional lexicon according to the entry storage structure comprises:

6. The method as claimed in claim 2, wherein the step S1) of forming the vocabulary into entries of the professional lexicon according to the entry storage structure comprises:

7. The method for constructing a professional thesaurus according to claim 1, wherein the vocabulary in step S2) comprises:

8. The method for constructing a professional thesaurus according to claim 1, wherein the step S2) of constructing an identification tree about the vocabulary and the vocabulary comprises:

9. The method for constructing a professional thesaurus according to claim 8, wherein the step S2) further comprises, after constructing the identification tree about the vocabulary and the vocabulary, and before obtaining the index value corresponding to the character string of the vocabulary through the identification tree:

and inquiring identifiers of nodes on a path from the identifiers of the current leaf nodes to the identifiers of the root nodes, changing the identification values of the columns in the characteristic vectors corresponding to the identifiers of the nodes on the path into first characteristic numerical values, and then changing the identification values of the columns in the characteristic vectors which are not changed into second characteristic numerical values to obtain changed characteristic vectors.

10. The method for constructing a professional thesaurus as claimed in claim 9, wherein the step S2) of obtaining the index value corresponding to the character string of the vocabulary through the identification tree comprises:

11. The method for constructing a professional lexicon as claimed in claim 10, wherein step S2) comprises performing a compression calculation on the modified feature vectors corresponding to the character strings of the vocabulary, wherein the compression calculation comprises:

calculating the numerical sum F (n)_i)：

12. A search method of a professional lexicon is characterized by comprising the following steps:

acquiring character strings of a recognition text, and inquiring index values matched with the character strings of the recognition text in the professional lexicon of any one of claims 1 to 11 through the mapping relation of any one of claims 1 to 11;

13. A system for professional vocabulary indexing, the system comprising:

14. An apparatus for professional vocabulary indexing, comprising:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of claims 1 to 11 by executing the instructions stored by the memory.

15. A computer readable storage medium storing computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 11.