Background
The STKOS vocabulary is a super-technology vocabulary which is based on the international advanced knowledge organization technology and method and is built by using the construction experience of the existing knowledge organization system at home and abroad and is oriented to computer application. The STKOS vocabulary is beneficial to better developing and utilizing resources such as scientific and technical literature, patents and the like, and has important information in various aspects such as national information industry promotion, literature sharing and the like.
However, the expansion of the STKOS vocabulary does not follow the pace of technology development, and there are situations that new words are not updated timely, the updating method is time-consuming, and the manpower input is high. Therefore, how to expand the STKOS vocabulary efficiently and intelligently becomes a technical problem which needs to be solved urgently in the field.
Content of application
In view of the above-mentioned shortcomings of the prior art, the present application aims to provide a scientific vocabulary expansion method, apparatus, terminal and medium based on grammar patterns, which are used to solve the technical problem that the STKOS vocabulary in the prior art cannot be efficiently and intelligently expanded.
To achieve the above and other related objects, a first aspect of the present application provides a scientific vocabulary expansion method based on grammar patterns, which includes: extracting a plurality of entity relationships from one or more texts based on a grammar pattern; determining one or more entity relationships associated with each of the search contents among the extracted plurality of entity relationships using one or more words in the original scientific and technological vocabulary before expansion as the search contents; and expanding the original scientific and technological word list based on the entity relation related to each search content to form a new scientific and technological word list with a larger vocabulary hierarchy compared with the original scientific and technological word list.
In some embodiments of the first aspect of the present application, the method further comprises: and determining the entity relationship extracted by mistake in the new science and technology vocabulary for correction according to the frequency of the entity relationship in the large-scale corpus.
In some embodiments of the first aspect of the present application, the method further comprises: if the frequency of the occurrence of the first entity relationship and the second entity relationship in the large-scale corpus, which are respectively formed by any two lower entities in the plurality of lower entities corresponding to the upper entity, is higher than the frequency of the occurrence of the third entity relationship in the large-scale corpus, which is formed by any two lower entities, the third entity relationship is determined to be the entity relationship with the extraction error.
In some embodiments of the first aspect of the present application, the method comprises: if the entity in an entity relationship can be divided into two or more independent words, judging whether the frequency of the two or more independent words appearing together is greater than a preset threshold value; and if the number of the independent words is larger than a preset threshold value, determining that the two or more independent words belong to the same entity.
In some embodiments of the first aspect of the present application, the method comprises: if the entity in the entity relationship can be divided into two or more independent words, judging whether the frequency of the independent words is greater than the frequency of the independent words; if so, determining that the independent term does not belong to the entity.
In some embodiments of the first aspect of the present application, the scientific vocabulary comprises a STKOS vocabulary.
In some embodiments of the first aspect of the present application, the entities in the entity relationship have membership.
To achieve the above and other related objects, a second aspect of the present application provides a scientific vocabulary expansion apparatus based on grammar patterns, comprising: the entity relation extracting module is used for extracting a plurality of entity relations from one or more texts based on the grammar mode; the word list expansion module is used for determining one or more entity relations related to each search content in the extracted entity relations by taking one or more vocabularies in an original scientific and technological word list before expansion as the search content, and expanding the original scientific and technological word list based on the entity relations related to each search content to form a new scientific and technological word list with a larger vocabulary hierarchy compared with the original scientific and technological word list.
To achieve the above and other related objects, a third aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the scientific vocabulary expansion method based on grammar patterns.
To achieve the above and other related objects, a fourth aspect of the present application provides an electronic terminal comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the memory so as to enable the terminal to execute the scientific and technological word list expansion method based on the grammar mode.
As described above, the scientific and technological vocabulary expansion method, apparatus, terminal, and medium based on the grammar schema of the present application have the following beneficial effects: the technical scheme of this application aims at providing the automatic expansion scheme of vocabulary based on grammar mode, can carry out high-efficient and intelligent vocabulary expansion to the STKOS vocabulary, follows closely the pace of science and technology development to effectively solve the difficult problem among the prior art.
Detailed Description
The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present application. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "above," "upper," and the like, may be used herein to facilitate describing one element or feature's relationship to another element or feature as illustrated in the figures.
In this application, unless expressly stated or limited otherwise, the terms "mounted," "connected," "secured," "retained," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.
Aiming at the technical problems that the expansion of the STKOS vocabulary in the prior art does not follow the pace of scientific and technological development, the update of new words is not timely, the update method is time-consuming, the manpower input is high and the like, the invention provides a scientific and technological vocabulary expansion method, a device, a terminal and a medium based on a grammar mode, and the problems in the prior art are effectively solved. The technical scheme of the invention aims to provide a vocabulary automatic expansion scheme based on a grammar mode, so that efficient and intelligent vocabulary expansion is carried out on the STKOS vocabulary, and the pace of scientific and technological development is followed.
Fig. 1 is a flow chart of a scientific and technological vocabulary expansion method based on a grammar pattern according to an embodiment of the present application.
It should be noted that the scientific and technological vocabulary expansion method based on the grammar pattern in the present application can be applied to various types of hardware devices. Specifically, the hardware device may be a controller, such as an arm (advanced RISC machines) controller, an fpga (field Programmable Gate array) controller, a soc (system on chip) controller, a dsp (digital Signal processing) controller, or an mcu (micro controller unit) controller, etc.; the hardware device may also be a computer device including components such as memory, memory controllers, one or more processing units (CPUs), peripheral interfaces, RF circuits, audio circuits, speakers, microphones, input/output (I/O) subsystems, display screens, other output or control devices, and external ports; the computer device includes, but is not limited to, a Personal computer such as a desktop computer, a notebook computer, a tablet computer, a smart phone, a smart television, a Personal Digital Assistant (PDA for short), and the like; the hardware device may also be a server, and the server may be arranged on one or more entity servers according to various factors such as functions, loads, and the like, or may be formed by a distributed or centralized server cluster, which is not limited in this embodiment.
In this embodiment, the scientific vocabulary expansion method based on the grammar pattern includes step S101, step S102, and step S103.
In step S101, a plurality of entity relationships are extracted from one or more texts based on a grammar schema.
In some alternative implementations, the syntactic patterns are based on a method of syntactic analysis, which discovers word-to-word relationships by analyzing the syntactic structure of a sentence. The grammar schema includes, but is not limited to, e.g., a heartst schema, which is classified as a grammar schema because grammar-related tags are specified in the heartst schema, which refers to mining context by recognizing grammar schemas in sentences.
In some alternative implementations, the entity relationships include isA entity relationships, i.e., there is a membership between entities in an entity relationship. Preferably, the entity relationships are extracted based on large-scale text derived from an English literature database and/or scientific literature content crawled by a web crawler. Specifically, the isA entity relationships can be automatically extracted from the textual content of titles, abstracts, full-text, etc. of large-scale document collections based on the hearts schema.
To facilitate understanding by those skilled in the art, the heartst mode mentioned in the present embodiment will now be explained in conjunction with table 1 below. Wherein, the left column in Table 1 represents the schema type; the right column indicates an example corresponding to the pattern type; NP denotes the Noun Phrase (Noun Phrase).
TABLE 1
The method for extracting the isA entity relationship executed in this embodiment is to match the heartt pattern with the text, taking the first pattern type in table 1 as an example: extracting a characteristic word 'sucas', and using noun phrases positioned before and after the characteristic word as upper and lower words; for example, "NP suc as { NP, } or | and } NP" can be expressed as "NP 0 suc as { NP1, NP2, …, (or | and) } NPn", which means that for either NPi, "Npi isA NP 0" holds.
Therefore, for words and sentences matching with the pattern, entity relationships can be extracted easily, for example, the entity relationship of "Linear Regression A Learning Algorithms" is extracted from "Learning Algorithms like Linear Regression, processing Tree"; the entity relationship of "processing Tree isA Learning Algorithms" is extracted from "Learning Algorithms Such as processing Tree", and the like, and therefore, similar embodiments are many and are not described herein.
In step S102, one or more vocabularies in the original scientific and technological vocabulary before expansion are used as search contents, and one or more entity relationships associated with each of the search contents are determined among the extracted entity relationships.
Preferably, one or more isA entity relationships associated with the search term are found from large-scale literature using the vocabulary of the STKOS scientific vocabulary prior to expansion, including the vocabulary of all levels as the search term.
In step S103, the original scientific and technological vocabulary is expanded based on the entity relationship associated with each of the search contents to form a new scientific and technological vocabulary having a larger vocabulary hierarchy than the original scientific and technological vocabulary.
Specifically, the original scientific and technological vocabulary is expanded according to one or more isA entity relations which are found from large-scale literature and are associated with search words, so that a new scientific and technological vocabulary with a larger vocabulary hierarchy compared with the original scientific and technological vocabulary is formed, the automatic expansion function of the vocabulary is realized, and the updating method is timely, time-saving and low in labor investment, so that the method has very obvious advantages compared with the prior art.
Fig. 2 is a flow chart of a scientific vocabulary expansion method based on a grammar pattern according to an embodiment of the present application. In this embodiment, the scientific and technological vocabulary expansion method includes steps S201 to S204.
In step S201, a plurality of entity relationships are extracted from one or more texts based on a grammar schema.
In step S202, one or more vocabularies in the original scientific and technological vocabulary before expansion are used as search contents, and one or more entity relationships associated with each of the search contents are determined among the extracted entity relationships.
In step S203, the original scientific and technological vocabulary is expanded based on the entity relationship associated with each of the search contents to form a new scientific and technological vocabulary having a larger vocabulary hierarchy than the original scientific and technological vocabulary.
It should be noted that, the method flow steps S201 to S203 in this embodiment are similar to the method flow steps S101 to S103 in the above embodiment, and therefore, the detailed description thereof is omitted.
In step S204, the entity relationship extracted with errors in the new science and technology vocabulary is determined according to the frequency of occurrence of the entity relationship in the large-scale corpus, so as to correct the entity relationship.
It should be noted that the modification includes, but is not limited to, the following operations: deleting the entity relationship of the extraction error, modifying the entity relationship of the extraction error, or covering the entity relationship of the extraction error, and the like, which is not limited in this embodiment.
In some aspects, the extraction of erroneous entity relationships is caused by, for example, noisy vocabulary. For example, the erroneous entity relationship of "cat isA dogs" is extracted from the sentence of "animal peak dogs" due to the interference of "other peak" which is a noisy word.
For the extraction error caused by the interference of the noise vocabulary, the method steps shown in fig. 3 can be executed to determine whether the entity relationship is the extraction error, as shown in steps S301 to S303 below:
in step S301, any two lower entities of the plurality of lower entities corresponding to a higher entity and the higher entity form a first entity relationship and a second entity relationship, respectively.
In step S302, it is determined whether the frequency of the first entity relationship and the second entity relationship appearing in the large-scale corpus is higher than the frequency of the third entity relationship formed by any two lower entities appearing in the large-scale corpus.
In step S303, if yes, it is determined that the third entity relationship is an entity relationship with an extraction error.
For example, let the first entity relationship t1 be "Cats isA animal", the second entity relationship t2 be "dog isAanimal", and the third entity relationship t3 be "Cats isA dogs"; if the frequency of the first entity relationship and the second entity relationship extracted from the large-scale corpus is higher than that of the third entity relationship, and the rule that the upper and lower entities cannot meet the upper and lower relationships is combined with two lower entities of the same upper entity, the third entity relationship can be determined to be the entity relationship with the extraction error.
It should be noted that, since the frequency occurring in the large-scale corpus directly relates to the reliability of the entity relationship, that is, the higher the frequency is, the more reliable the entity relationship is, and otherwise, the less reliable the entity relationship is, under the condition that whether the entity relationship is extracted incorrectly is determined, it is preferable to determine whether the frequency of the first entity relationship and the second entity relationship occurring in the large-scale corpus is significantly higher than the frequency of the third entity relationship, and if so, it is determined that the third entity relationship is incorrect, and such a determination manner can improve the accuracy of the determination.
In addition, the term "significantly higher" in this embodiment can be used to distinguish whether the frequency of the first entity relationship (or the second entity relationship) appearing in the large-scale corpus is significantly higher than the frequency of the third entity relationship by calculating whether the ratio of the frequency of the first entity relationship to the frequency of the third entity relationship appearing in the large-scale corpus is greater than the preset threshold.
In other aspects, the entity relationship that is extracted incorrectly is caused by a word segmentation error, for example. For example, in "algorithms including SVM, LR and RF", the participle model has difficulty determining whether "LR and RF" is exactly one entity or two entities, resulting in an extraction error.
For extraction errors caused by interference of word segmentation errors, whether the entity relationship is an extraction error can be determined by executing the method steps shown in fig. 4, specifically as shown in steps S401 to S402 below:
in step S401, if an entity in an entity relationship can be divided into two or more independent terms, it is determined whether the frequency of the two or more independent terms is greater than a preset threshold.
In step S402, if the number of the independent words is greater than the preset threshold, it is determined that the two or more independent words belong to the same entity.
Taking the expression "LR and RF" as an example, it is determined whether "LR and RF" is an entity or two entities, wherein "LR and RF" can be divided into two independent words (i.e., "LR" and "RF"). In large scale, if the frequency of occurrence of "LR and RF" is greater than a predetermined threshold, it is determined that "LR and RF" are integral, i.e., "LR" and "RF" belong to the same entity.
For extraction errors caused by interference of word segmentation errors, the method steps shown in fig. 5 may also be performed to determine whether the entity relationship is an extraction error, specifically as follows:
in step S501, if an entity in an entity relationship can be divided into two or more independent terms, it is determined whether the frequency of the independent terms is greater than the frequency of the independent terms.
In step S502, if yes, it is determined that the independent word does not belong to the entity.
Still taking the expression "successful including SVM, LR and RF" as an example, it is necessary to determine whether "LR and RF" is an entity or two entities, wherein "LR and RF" can be divided into two independent words (i.e., "LR" and "RF"). In large scale anticipation, it may be determined that "LR and RF" is not a whole, i.e., "LR" and "RF" are separate entities, if the independent word "LR" occurs more frequently than "LR and RF" occurs together, or the independent word "RF" occurs more frequently than "LR and RF" occurs together.
It should be noted that, since the frequency occurring in the large-scale corpus directly relates to the reliability of the entity relationship, that is, the higher the frequency is, the more reliable the entity relationship is, and otherwise, the less reliable the entity relationship is, under the condition that whether the entity relationship is extracted incorrectly is determined, it is preferable to determine whether the frequency of the occurrence of the independent term is significantly higher than the frequency of the occurrence of the independent term together, and if so, it is determined that the independent term is not a component of the entity, so that the determination method can improve the accuracy of the determination.
In addition, the term "significantly higher" in this embodiment can be used to distinguish whether the frequency of the independent words appearing in the large-scale corpus is significantly higher than the frequency of the independent words appearing in the large-scale corpus by calculating whether the ratio of the frequency of the independent words appearing to the frequency of the independent words appearing together is greater than a preset threshold.
In an embodiment, the present application further provides a computer storage medium having a computer program stored thereon, which when executed by a processor, performs the above-mentioned method steps.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 6 is a schematic structural diagram of a scientific and technological vocabulary expansion apparatus based on a grammar pattern according to an embodiment of the present application. The scientific and technological vocabulary expansion device comprises an entity relation extraction module 61 and a vocabulary expansion module 62.
The entity relationship extracting module 61 is configured to extract a plurality of entity relationships from one or more texts based on a grammar pattern; the vocabulary expansion module 62 is configured to determine one or more entity relationships associated with each of the search contents from among the extracted plurality of entity relationships, using one or more vocabularies in an original scientific vocabulary before expansion as search contents, and expand the original scientific vocabulary based on the entity relationships associated with each of the search contents to form a new scientific vocabulary having a larger vocabulary hierarchy than the original scientific vocabulary.
The implementation of the scientific and technological vocabulary expansion apparatus based on the grammar pattern provided in this embodiment is similar to the implementation of the scientific and technological vocabulary expansion method based on the grammar pattern, and is not repeated here.
It should be further understood that the division of the modules of the above apparatus is only a division of logical functions, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated.
Fig. 7 is a schematic structural diagram of another electronic terminal according to an embodiment of the present application. This example provides an electronic terminal, includes: a processor 71 and a memory 72; the memory 72 is connected to the processor 71 through a system bus and is used for completing communication between the processor and the memory 72, the processor 71 is used for running the computer program, and the electronic terminal is enabled to execute the steps of the scientific vocabulary expansion method based on the grammar mode.
The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In summary, the present application provides a scientific and technological vocabulary expansion method, apparatus, terminal, and medium based on a grammar mode, and the technical scheme of the present application aims to provide an automatic vocabulary expansion scheme based on a grammar mode, which can perform efficient and intelligent vocabulary expansion on an STKOS vocabulary, following the pace of scientific and technological development, thereby effectively solving the problems in the prior art. Therefore, the application effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.