CN115935943A - Analysis framework supporting natural language structure calculation - Google Patents
Analysis framework supporting natural language structure calculation Download PDFInfo
- Publication number
- CN115935943A CN115935943A CN202211333124.2A CN202211333124A CN115935943A CN 115935943 A CN115935943 A CN 115935943A CN 202211333124 A CN202211333124 A CN 202211333124A CN 115935943 A CN115935943 A CN 115935943A
- Authority
- CN
- China
- Prior art keywords
- grid
- language
- attribute
- finite state
- units
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Devices For Executing Special Programs (AREA)
Abstract
The invention provides an analysis framework supporting natural language structure calculation, which comprises three functional modules of a grid, a data table and a finite state automata, and designs a series of APIs (application programming interfaces) to complete the writing of language structure calculation scripts by matching with a Lua script language, wherein the APIs comprise a master control script and a finite state automata script; and executing the language structure calculation script through an executable program so as to realize language structure calculation. The invention takes symbolic calculation as the leading factor, and receives input and output results, thereby having better controllability, expandability and interpretability; in the calculation process, each parameter calculation model can be scheduled, and the capability of the parameter model can be fully exerted; finally, the framework can also introduce explicit knowledge, and under the guidance of the explicit knowledge, deep and fine language structure calculation is possible.
Description
Technical Field
The invention relates to a natural language processing method, in particular to an analysis framework supporting natural language structure calculation.
Background
Language structure calculation is the process of structuring natural language, i.e. parsing natural language into language structures, and is the essence of natural language understanding. The language structure is the representation of the regularity of natural language in various aspects such as form, content and usage. This regularity is formally embodied as a syntactic structure; the content is embodied as a semantic structure; are embodied in usage as pragmatic structures.
The prior art has two strategies for performing language structure calculation: end-to-end policies and cascading policies. The end-to-end strategy is to input original data and directly output a final result, wherein the mainstream method is a deep learning-based method which is often data-driven, an optimization function is set for a target, a parameterized vector is used for representing a language unit, and a model is constructed through parameter learning.
The end-to-end strategy is a method generally adopted in recent years, and the effect of almost all natural language processing tasks, no matter the natural language processing tasks are body tasks of NLP or ground application, is greatly improved. But this strategy also encounters the following bottlenecks in development:
1) The ability to use explicit knowledge is lacking. The knowledge of the deep learning model is completely from data, such as domain knowledge, grammar knowledge and semantic knowledge in labeled data and world knowledge in a large-scale pre-training language model. However, external non-data-class knowledge, such as expert knowledge and offline knowledge, is not embedded into the model in a proper way, and the model cannot solve complex problems according to knowledge guidance.
2) The ability to reason and analyze is lacking. The deep learning model is a black box type parameter system, the interior of the deep learning model is not a human-like concept system, the exterior of the deep learning model cannot be associated with world knowledge, and reasoning and analysis cannot be carried out.
3) Lack of interpretability and controllability. Due to the characteristics of the black box in end-to-end calculation, the deep learning model has a completely parameterized internal structure and has the problem of poor interpretability and controllability.
The cascading strategy is to decompose the problem into two or more successive subtasks, each subtask is completed independently, and the input of the subsequent subtask is the task output of the preamble. For example, when a cascade strategy is adopted to perform a chinese syntax analysis task, the task is often divided into two stages: one is Chinese word segmentation and the other task is syntactic structure analysis, wherein the syntactic structure analysis takes the output of the Chinese word segmentation task as input.
The problem of the cascade strategy is mainly error transmission, and the accuracy and recall rate indexes of the whole model are not ideal due to the 'probability multiplication' among cascade models.
Disclosure of Invention
The invention provides an analysis framework supporting natural language structure calculation, which is used for solving the problem of language structure calculation, and the technical scheme is as follows:
an analysis framework supporting natural language structure computation, comprising:
grid: a data structure is used for storing language structures and is used as a computing platform for bearing various language structures;
data table: the unary knowledge and the binary knowledge are used for packaging the symbol type as a packaging component;
a finite state automaton: a calculation control component used for representing the language context, which is matched with the script language to complete the control task of the code function structure,
the API system is used for completing the writing of the language structure calculation script by matching with the Lua script language around the grids, the data tables and the finite state automata;
language structure the executable program of the computing script: the index function is provided, and the indexes of the data table and the finite state automata can be realized; the system has an execution function, can execute the language structure calculation script and realize the language structure calculation; the system has an executing function for the master control script, writes the data table, the index file path of the finite state automaton, the IP address and the interface of the external service into a configuration file, and transmits the configuration file into an executable program.
The grid includes the following internal variables: the method comprises the steps of analyzing a text to be analyzed, grid attributes, grid units and the relation among the grid units;
the grid cell includes the following internal variables: the stored language units, the attributes of the grid units, and the features and scores of the attributes of the grid units;
the relationships between the grid cells include the following internal variables: attributes of relationships between grid cells, characteristics and scores of relationships between grid cells.
The language structure is expressed as a directed graph with attribute information, and comprises language units, relations and attributes, and the loading mode of the language structure loaded into the grid is as follows: the language units are borne by the grid units, and each language unit corresponds to one grid unit; relationships between language units are carried by relationships between grid units; the attributes of the language units are carried by the attributes of the grid units; the attributes of the relationships between language units are carried by the attributes of the relationships between grid units.
The language structure is denoted G = (U, R), wherein:
1) U is a node of the graph, here a finite set of language units, noted: u = { tu, au }, tu ∈ Lex, lex is the set of language unit strings, au is the attribute of the language unit, is the set of key-value pairs, and is written as: au = { K = V }, K is an attribute name, and V is an attribute value;
2) R is a finite set of edges connecting two different nodes in U, i.e., a set of relationships between two language units; since language units are often not dual, it is a directed edge, which is noted as: r = { u = i ,u j R, ar }, where: u. u i ∈U,u j And E is U, r is RT, and RT is a set of relationship types. ar is the property of the linguistic unit relationship, which is a collection of key-value pairs, noted as: ar = { K = V }, K is an attribute name, and V is an attribute value.
The API includes: (1) External services are called, and a service result is led into an API (application program interface) of the grid, so that the scheduling of the parameter calculation model is realized; (2) The API for controlling the internal variables of the grid structure comprises the addition, the test and the acquisition of the internal variables, so that the controllability of the language structure is realized; (3) The API of the data table is applied, namely, the interaction between the data table and the grid is realized, so that display knowledge is introduced; (4) The API of the finite state automata is applied, namely, the interaction between the finite state automata and the grid is realized, and therefore the functions of the framework are expanded.
The data table is defined as a set of triples, namely: tableName = { < Item, attribute, condition > }, wherein:
TableName: a table name;
item: the data item is Word in a character string form or is a key value expression KV, corresponds to a language unit when being Word, and corresponds to a grid unit when grid computing is carried out; when the data item is KV, all grid cells meeting the condition that KV is true in the corresponding grid are identified by the data item;
attribute: the attribute of Item is the set of key-value pairs, i.e., { K = V }, and { K = V } is brought into the attribute of the grid cell or the relationship attribute between cells;
condition: and the definition condition of the application Item is a set of key value expressions, namely { KV }, and when the grid unit corresponding to the current Item meets one of the { KV }, continuing the next operation, including adding grid units, attributes or establishing a relationship.
The data tables are divided into two types, one type is a description type data table, and the other type is a relational type data table;
the described objects of the described data table are language units which are independent one by one, the form of the language units and the attribute knowledge of the language units are given, and the knowledge is used for determining the language units and setting the attribute of the language units;
the relational data sheet is characterized in that the written object relates to two language units, one is a central language unit, the other is a language unit having a certain relation with the central language unit, binary relations are packaged by a plurality of data sheets, the relational data sheet is designed into a form of a main sheet and a plurality of auxiliary sheets, the main sheet stores a central language unit list, and the auxiliary sheets store a language unit list forming a certain relation with the central language unit.
The finite state automaton is defined as a set of four tuples, namely: FSAName = { < Enter, path, operation, exit > }, wherein: and (3) FSAName: name of finite state automata, globally unique name; enter: an entry node, a finite state automaton having a unique entry; and (4) Path: the path corresponding to the finite state automaton represents context information during structure calculation; operation: an operation node, namely an action to be executed when the context corresponding to the Path is tested successfully; exit: and the exit node is provided with a unique exit by a finite state automaton.
One script can correspond to one or more finite state automata, and the script of each finite state automata comprises an FSA name, a parameter item, a control item and a function library shared by a plurality of finite state automata;
the FSA name: different finite state automata are distinguished through FSA names;
the parameter items are: configuring the current FSA operation related condition;
the control item is: describing different Context conditions and corresponding operations, wherein the Context comprises a Context part and an Operation part, the Context is used for describing a condition part of an FSA path and is composed of a plurality of items; operation describes Operation under corresponding conditions;
the function library: the function library Name is defined in the form of "NameSpace Name", which is a reserved word, and the function body is defined in the form of a Lua script.
Further, the calling of the external service and the importing of the service result into the API of the grid realize the application of the parameter model, including:
(1) CallService (sequence, serviceName): calling an API of a service, wherein the called service is a parameterized model for structural analysis, structuralizing a text to be analyzed, and returning data of an initial language structure; the first parameter is the input to be transmitted to the service, the second parameter is the service name, the configuration is needed in the configuration file, and the return value is the language structure returned by the service;
(2) AddStructure (Sennce _ JSON): language structures in JSon format are injected into the grid.
The API for controlling the internal variables of the grid structure realizes the addition, acquisition and test of the internal variables of the grid structure, improves the controllability and interpretability of language structure calculation, and comprises the following steps: (1) an addition class API of internal variables; (2) an acquisition type API of the internal variable; and (3) testing type API of internal variables.
The API of the application data table is divided into three types: function type, acquisition type and test type, wherein the function type API realizes the interaction between the grid and the data table, so that the framework has the capability of applying explicit knowledge, and the method comprises the following steps:
1)Segment(TableName)
the Segment has the functions of segmenting the text in the grid based on the data table and adding attributes;
2)SetLexicon(TableName)
the SetLexicon function is to add the attribute of the data item in the data table to the grid unit and provide the application attribute information for the grid unit;
3)Relate(TableName)
the function of the relation is to realize the import of the relation type data table by calling the name of the main table, the name of the relation class and the name of the relation;
when a relationship type data table is imported into a grid by using a relationship function, a data item in the data table corresponds to a grid unit, a relationship formed by the data item in a main table and a data item in a secondary table corresponds to a unit relationship in the grid, and the function realized by the relationship function is decomposed into the following steps:
(1) importing the main table data items meeting the Limit condition and the attributes thereof into the grid; adding an 'ST-Unit' attribute to the grid, wherein the value of the attribute is TableNAme, and the grid Unit with the source of the TableNAme is represented in the grid;
(2) acquiring all the slave tables corresponding to the data items of the master table through the Coll attribute;
(3) adding a slave table data item satisfying the Limit condition to the grid;
(4) adding the grid unit relationship, the U-type attribute and the R-type attribute into a grid unit, and taking TableName as a binary data table of a main table;
(5) adding the attribute of the slave table data item to the attribute of the grid cell relation;
(6) adding attributes of 'URoot', 'URoot TableName', 'RRoot', 'RRootTableName' and 'ST-Relations' to the grid, wherein for each relationship < HeadUnit, subUnit, relations > between grid units successfully added to the grid, the attribute values of the above attributes are respectively as follows: headUnit, relationship, tableName.
4)Str=GetPrefix(TableName,String)
GetPrefix judges whether the character string takes a certain data item in the data table as a prefix string, if so, the longest matching string is returned;
5)Str=GetSuffix(TableName,String)
GetSuffix judges whether the character string takes a certain data item in the data table as a suffix string, and if yes, the longest matching string is returned.
The API of the finite state automata is applied to realize the interaction between the grid and the finite state automata, so that the framework has the capacity of efficiently identifying and processing the context; this class of APIs includes runfssa and GetFSANode and GetParam:
(1)RunFSA(FSAName(,Param))
the API has the functions of executing a finite state automaton, namely completing the matching of Context and grids in an FSA script and executing corresponding Operation, wherein FSAName is the name of the finite state automaton to be executed, param is a parameter to be transmitted, and the parameter can be called in the finite state automaton script through an API GetParam;
(2) No = GetFSANode (-1) or No1, no2= GetFSANode ("$ Tag")
Executing a finite state automaton, namely completing the matching of Context in a finite state automaton script and a grid, and completing the communication from the Enter to the Exit node if the matching is successful, wherein one or more paths are communicated at the moment; in the Operation script, the grid cells corresponding to the nodes in the Context can be accessed, so that the attribute nodes on the FSA path are numbered in sequence, and the grid cells are further accessed by referring to the Node numbers.
(3)Str=GetParam(Key)
The values of the parameters passed in by the RunFSA function are read in the FSA.
The analysis framework supporting the natural language structure calculation takes symbolic calculation as a leading factor, receives input and output results, and has better controllability, expandability and interpretability; in the calculation process, each parameter calculation model can be scheduled, and the capability of the parameter model can be fully exerted; finally, the framework can also introduce explicit knowledge, and under the guidance of the explicit knowledge, deep and fine language structure calculation is possible.
Drawings
FIG. 1 is a schematic diagram of the analysis framework supporting natural language structure computation;
FIG. 2 is a language structure represented as a directed graph with attribute information;
fig. 3 is a finite state transition diagram into which finite state automata scripts are compiled.
Detailed Description
As shown in fig. 1, the analysis framework supporting natural language structure calculation includes three functional modules, which are a grid, a data table and a finite state automata, respectively, and a series of APIs (Application Programming interfaces) are designed around the three functional modules to complete writing of language structure calculation scripts, including a master control script and a finite state automata script, in cooperation with a Lua scripting language; finally, the framework designs a set of executable programs to execute the language structure calculation script, thereby realizing the language structure calculation.
1. Grid: as a computing platform carrying various language constructs. The complexity of natural language is that a one-dimensional sequence of symbols corresponds to a two-dimensional formal and conceptual structure. Therefore, in the calculation process, the calculation structure is subject to various ambiguity phenomena, including ambiguity of language unit boundaries, ambiguity of language concepts and the like. The grid can simultaneously accommodate analysis structures of different layers, different algorithms and even different systems, so that the analysis structures can be cooperated and can be independently distinguished, and the generation of complex target structures is supported together.
2. Data table: means for representing and processing knowledge. The framework takes an expert system leading knowledge as master control to complete complex language structure analysis including deep semantic analysis and the like, so that formalized representation of knowledge is required and rapid calculation of massive knowledge can be supported. In this framework, data tables are employed to meet the capacity requirements of this section.
3. Finite state automata: a computational control component for characterizing a language context. Natural languages have complex contexts, and require a simple, efficient, expressive and processing-capable control component, and in script languages, branch divide and conquer processing of different conditions is usually performed by using logical control statements. The framework adopts a finite state automaton and is matched with a script language to jointly complete the control task of a code functional structure.
The API system and the framework have the following design: (1) The framework is provided with an API for calling external services and importing service results into grids, so that scheduling of a parameter calculation model is realized; (2) The framework is designed with an API for controlling internal variables of the grid structure, including the addition, the test and the acquisition of the internal variables, thereby realizing the controllability of a language structure; (3) The framework is designed with an API for applying the data table, namely, the interaction between the data table and the grid is realized, and therefore, display knowledge is introduced; (4) The framework is designed with an API applying the finite state automata, namely, the interaction between the finite state automata and the grid is realized, and thus, the functions of the framework are expanded.
The framework of the executable program of the language structure calculation script is designed as follows: (1) The executable program in the framework has an indexing function, and can realize indexing of the data table and the finite state automata, so that the data table and the finite state automata can be efficiently used in execution; (2) The executable program in the framework has an execution function and can execute the language structure calculation script to realize the language structure calculation; (3) The executable program in the framework is the execution of the master control script, and in order to realize the calling of a data table, a finite state automaton and external services in the master control script, the framework designs a configuration file, and the specific method comprises the following steps: writing the index file path of the data table and the finite state automaton and the IP address and the interface of the external service into the configuration file, and transmitting the configuration file into the executable program.
The three components in the framework and the API architecture are detailed below, as shown in fig. 1.
1. Grid mesh
The following description of the manner in which a grid stores language structures, including: the grid is used as a definition of a data structure, a formalized definition of a language structure and a corresponding relation between the grid and the language structure.
1.1 definition of the grid
A grid, as a data structure, includes the following internal variables: the text to be analyzed, the grid attributes, the grid cells and the relationship among the grid cells. Wherein, the grid cell is also a data structure, comprising the following internal variables: the stored language units, the attributes of the grid units, and the features and scores of the attributes of the grid units; the relationship between grid cells is also a data structure that includes the following internal variables: attributes of relationships between grid cells, characteristics and scores of relationships between grid cells. The specific definitions of the above variables are as follows:
1) Text to be analyzed: i.e. the analysis object in the current analysis flow, the variable value is usually a sentence, a paragraph or a chapter.
2) Grid property: that is, the attribute of the current grid is a set formed by one or more key value pairs in the form of "K = V" for storing information related to other internal variables in the grid and log information in analysis, where "K" is an attribute name and "V" is an attribute value, and the attribute value may be one or a plurality of sets.
3) Grid cell: grid cells are the basic elements of a grid, and a grid usually includes a plurality of grid cells arranged in a matrix form, each grid cell has a unique cell number, and the number is composed of a column number c and a row number r (c, r).
4) Grid cell attributes: the attributes of a grid cell are descriptions of the properties of the grid cell, including attribute names and attribute values, which are stored in the form of key-value pairs, namely: key = Value (abbreviated K = V). The attribute value may have an independent attribute value or a plurality of attribute values which are different from each other, i.e., a set of attribute values, by means of the logical symbol "[ 2]]", may be abbreviated as" K = [ v ] 1 ,v 2 ]”。
According to different attribute description contents, the attribute description contents can be divided into property attributes and relationship attributes. The property attribute is description of the property of the grid cell, such as the type, position and other information of the grid cell; a relationship attribute is a description of the relationship of the current grid cell to other grid cells.
5) Features and scores of grid cell attributes: the attribute value includes a Feature, which is a Feature description, and a set of one or more key-value pairs in the form of "Feature = Score" for storing the Feature and the Score of the current attribute value.
6) Relationship between grid cells: the partially ordered binary relationship between two grid cells, in the form of < head Unit, subUnit, relation >, means that "grid cell head Unit points to grid cell SubUnit with the relationship Relation".
7) Attributes of relationships between grid cells: that is, the attribute of the relationship between the belonging grid cells is a set formed by one or more key value pairs in the form of "K = V" for storing information related to the relationship between the current grid cells, where "K" is an attribute name, and "V" is an attribute value, and the attribute value may be one or a plurality of sets.
8) Characteristics and scores of relationships between grid cells: the Feature of the relationship between the belonging grid cells is a set formed by one or more key-value pairs in the form of "Feature = Score" for storing the Feature and the Score of the relationship between the current grid cells, wherein "Feature" is the Feature description, and "Score" is the Score of the Feature.
1.2 formalized definition of language constructs
In this framework, the language structure is treated with the unit, relationship and attribute perspectives, i.e., the language structure is represented as a directed graph with attribute information, as shown in fig. 2. Node u in FIG. 2 1 ,u 2 Representing language units by nodes u 2 As a starting point, with u 1 The directed edge r for the end point represents the relationship between two language units, both on the node and on the edgeThe attributes of the language units and the attributes of the relationships between the units can each be represented by a key-value pair in the form of K = V.
The language structure is a directed graph, defined according to the definition of the graph, i.e. a bigram, formally defined as: g = (U, R), wherein:
1) U is a node of the graph, here a finite set of language units, noted: u = { tu, au }, where: tu ∈ Lex, lex is the set of language unit strings (tokens), au is the attribute of the language unit, is the set of key-value pairs, and is noted as: au = { K = V }, K being an attribute name, and V being an attribute value.
2) R is a finite set of edges connecting two different nodes in U, i.e. a set of relationships between two language units. Since language units are often not dual, there is a directed edge. Is recorded as: r = { u = i ,u j R, ar }, where: u. u i ∈U,u j And E is U, r is RT, and RT is a set of relationship types. ar is the property of the linguistic unit relationship, which is a collection of key-value pairs, noted as: ar = { K = V }, K is an attribute name, and V is an attribute value.
1.3 correspondence of grids to language constructs
Any language structure, including grammar structure, semantic structure and pragmatic structure, can be formally expressed by adopting the viewpoint of 'language unit, relation and attribute' and loaded into the grid, so that the grid has the capability of bearing multi-source and multi-type language structures. The bearing mode is as follows: the language units are borne by the grid units, and each language unit corresponds to one grid unit; relationships between language units are carried by relationships between grid units; the attributes of the language units are carried by the attributes of the grid units; the attributes of the relationships between language units are carried by the attributes of the relationships between grid units.
2. Data sheet
Data tables are the packaging components of explicit knowledge, which includes unary-class knowledge and binary-class knowledge. Unary knowledge is descriptive knowledge that describes the properties of a language unit in terms of pinyin, part of speech, translation, and the like. Binary knowledge is relationship class knowledge that describes the type of relationship between language units.
In the framework, the data table is used for packaging the unary knowledge and the binary knowledge of the symbol type, and important information sources are provided for constructing grid cells, establishing relations, setting attributes and the like during language structure calculation.
2.1 formalized definition of data tables
The data table may be defined as a set of triples, namely: tableName = { < Item, attribute, condition > }, wherein:
TableName: a table name;
item: the data item may be in a string form (Word) or a key-value expression (KV). When the Word is Word, the Word corresponds to a language unit, and when the grid is calculated, the Word corresponds to a grid unit; and when the data item is KV, the data item corresponds to all grid cells which meet the condition that KV is true in the grid.
Attribute: the property of Item, which is a set of key-value pairs, i.e., { K = V }, is usually brought into the property of the grid cell or into the property of the relationship between cells.
Condition: and the definition condition of the application Item is a set of key value expressions, namely { KV }, and when the grid unit corresponding to the current Item meets one of the { KV }, the next operation is continued, including adding grid units, attributing or establishing a relationship and the like.
It should be noted that, for the key-value expression (KV), the following definitions are defined in this framework: the key value pair K = V is used to describe the attribute, and in the computing scenario, the attribute may also be tested, that is, a logical "true" or "false" is returned, which represents whether the computing object has the attribute of the key value pair "K = V". This "K = V" used for calculation is called a key value expression, and is referred to as "KV". In addition, the logical operations that multiple key-value expressions participate in are also referred to as key-value expressions, and the logical operators include: and "&"; not! "; or "[ ]".
2.2 Format definition of data tables
In the framework, one or more data tables are contained and stored in one or more files, and the format of the data tables is defined as follows:
Table TableName
#Global{K=V}Limit=[{KV}]
Word{K=V}Limit=[{KV}]
KV{K=V}Limit=[{KV}]
wherein:
line 1: table is a reserved word, tableName is the name of the data Table, has global uniqueness, is written after the reserved word "Table" in the first row, and references the data Table using the data Table name when applied.
Line 2: "# Global" is a reserved word followed by a key-value pair or by a key-value expression elicited by Limit. The entry is optional and if present, indicates that something is behind, and all entries are shared for the data table.
Line 3: item is the usage notation of string (Word), { K = V } is the set of key-value pairs, and "Limit" is the reserved Word, indicating the restrictive condition to apply the entry, and one or more key-value expressions can be written in the logical OR [ ] ".
Line 4: item is the usage schematic of the key-value expression (KV), { K = V } is the set of key-value pairs, "Limit" is the reserved word, indicating the restrictive condition to apply the entry, in which one or more key-value expressions can be written in the logical or [ ] ".
2.3 types of data tables
The data tables are divided into two types, one is a descriptive data table similar to a dictionary, and the other is a relational data table for conveniently constructing the relationship of two language units.
The described objects are independent language units, and the knowledge of the form of the language units, the attributes of the language units and the like is given, and the knowledge is mainly used for determining the language units and setting the attributes of the language units.
In contrast to the semantic field, the object depicted in the relational data sheet relates to two language units, one is a central language unit and the other is a language unit having a certain relationship with the central language unit, and the binary relationship is generally a partial order relationship, either grammatically or semantically. Relational data tables typically encapsulate binary relations with multiple data tables, usually designed as one master table, multiple slave tables. The main table stores a Central language unit list, and the sub-table stores a language unit list having a certain relationship with the Central language unit.
Attributes are needed in the master table to specify slave tables that may form some specific relationship with the current data item. The specific method comprises the following steps: a relationship name is given in the Coll attribute of the master table, and a slave table name capable of producing a specific relationship with the current data item is specified by an attribute in the form of "Coll-relationship name = [ slave table name 1 from table name 2 ]".
3. Finite state automaton
As a control component, the finite state automaton can efficiently express and execute a context in the language structure analysis. The writing adopts a finite state automata script grammar, and the conditions of the context can be described more clearly in logic. In the application stage, a finite state automaton described by the script grammar is converted into a form of an internal node connection graph through a compiling tool, the state of a grid unit is tested by using a key value expression (KV) in a node, and when the key value expression is true, the next node is switched to.
In the invention, one or more finite state automata can be designed according to the requirement, each finite state automata is independent, and the finite state automata is called by an upper-layer script to realize the preset function.
3.1 definition of finite State automata
In the present invention, a finite state automaton is defined as a set of quadruplets, namely: FSAName = { < Enter, path, operation, exit > }, wherein:
FSAName: name of finite state automata, globally unique name. Enter: and the entry node and the finite state automaton have unique entries. And (4) Path: the path corresponding to the finite state automaton represents context information when the structure is computed. It is an ordered sequence of multiple attribute test nodes (nodes); in the invention, the attribute test node stores a key value expression (KV), and the key value expression is used for the attribute test of the grid unit. Operation: and when the context corresponding to the operation node, namely the Path, is tested successfully, the action to be executed is carried out. Exit: and the exit node is provided with a unique exit by a finite state automaton.
3.2 written grammar for finite state automata
In the present invention, one script may correspond to one or more finite state automata. The script of each finite state automaton comprises three parts of an FSA name, a parameter item and a control item, and a function library part shared by a plurality of finite state automatons, and the content of each part is specifically described in combination with the following script.
1.FSAFSAName
2.#Entry EntryNode=[KV]
3.#Parameter Order=Yes MaxLen=Yes Nearby=Yes Bound=Clause
4.#Include CodeLib
5.
6.Context1
7.{
8.Operation1
9.}
10.
11....
12.Item1 Item2 SubName
13.{
14.Operationn
15.}
16.
17.sub SubName
18.(
19.Context
20.)
21.
22.NameSpace CodeLib
23.function FuncName1()
24....
25.end
26.
27.function FuncName2()
28....
29.end
1) Name of FSA
The name of the finite state automata is customized by a user, and after the reserved word "FSA" is written, different finite state automata are distinguished by the name of the FSA, namely the name of the finite state automata has global uniqueness, such as line 1 in a script.
2) Parameter item
The parameter items are the configuration of the current FSA run-related case, e.g., lines 2-4 in the script.
Indicating the condition which needs to be met by a pre-alignment node of the current finite state automaton in a form of "# Entry = [ KV ]", wherein "# Entry" is a reserved word, entry node can be self-defined, and KV is a key value expression related to the attribute of the pre-alignment node;
the matching condition of the current finite state automaton is indicated in the form of "# Parameter K = V", wherein "# Parameter" is a reserved word, and K = V mainly includes the following 4 types:
order = Yes/No, whether the node sequence in the finite state automaton is consistent with the word sequence or not is set, and the default is Yes.
Setting an operation mode of the finite state automaton, and if Yes, only executing the operation corresponding to the longest matching path; and when the number is No, corresponding operations of all matching paths are executed, and the default is No.
And (Nearby = Yes/No), setting whether two adjacent nodes of the finite state automaton are required to be adjacent in the grid or not, and setting the default value to No.
Bound = set/class/Group/Chunk: and setting the matching range of the finite state automaton, wherein the default is set.
And declaring a function library called by the current FSA in the form of "# Include" which is a reserved word, and ModuleName which is a function library name, wherein the function library name can be a function library defined in the current file or a function library defined in other files.
3) Control item
The control item is the main component of the FSA, describes different Context conditions and corresponding operations, and consists of two parts, namely Context and Operation, such as 6-15 lines in a script. Wherein, context is used for describing the condition part of the FSA path and is composed of a plurality of items; operation describes Operation under corresponding conditions.
When the Context is too complex or the reusability of a certain part is high, the part can be encapsulated in the form of "sub _ sub ()", i.e. SubContext, and then the name "SubContext" of SubContext is called in the Context, for example, 12-20 in the script. The SubContext can be used in a nested way, when only one layer of SubContext is available, the position of the SubContext in the script is not limited, the SubContext can be used in front of the script, and when the SubContext is called internally, the callee is put in front of the script.
4) Function library
The function library Name is defined in the form of "NameSpace Name", and the function body is defined in the form of Lua script. Where "NameSpace" is a reserved word, such as lines 22-29 in the script. When using a function library in the current FSA, it needs to be declared in the parameter item "# included", as in line 3 of the script, and then called by a function in the function library in Operation.
The API architecture in this framework is next described.
1. Calling external services and importing service results into an API of the grid, wherein the API realizes the application of a parameter model in the framework and comprises the following steps:
1.1CallService (sequence, serviceName): the called service is generally a parameterized model for structural analysis, and has the main function of structuring the text to be analyzed and returning data of an initial language structure. The first parameter is the input to be passed to the service, the second parameter is the name of the service, which needs to be configured in a configuration file, and the return value is the language structure returned by the service.
1.2AddStructure (Sennce _ JSON): a language structure in JSon format is injected into the grid. And checking whether the text content is consistent with the text content in the grid during injection, if so, superposing the structure in the grid, and otherwise, restarting a new analysis grid.
The format of the sequence _ JSon is defined as follows:
{ "Type" = "", "Units": string/Tree ], "POS": string ], "Groups" [ { "HeadID": int, "group": { "road": string, "subiD": int } ] } ] }, and the fields are as follows:
the key name Type, units, POS, groups, headID, group, role, subID are reserved words.
1)Type:
And when the number is "set", the sentence is represented, and the Units are unmarked text.
When the Word is "Word", the Word is expressed, and the Units is a Word sequence.
A "Chunk" indicates a Chunk, and Units is a sequence of chunks.
When "Tree" is used, a Tree structure is indicated, and Units is a Tree structure in the form of parentheses.
The default is "Chunk". The content of Type determines the cell Type when Units are imported into the grid.
2)Units:
According to the Type content, the Type content can be a sentence, a word sequence, a tree structure or a chunk sequence, wherein when the Type is a chunk, each Unit in the Units can be a chunk, and can also be a word sequence forming the chunk, which is in the form of: "Word/(KV KV) Word/(KV).", the Word in one Unit is connected to form the content of the current chunk.
3)POS:
The POS sequence corresponds to the attribute information of the Units, and when a certain language unit in the Units sequence does not need to be added to the grid, the corresponding element may be set to "None" in the POS sequence.
4)Groups:
Dependency structure information is represented, and for non-dependency structure representations, this term may not be included.
5)HeadID:
The value of the depended-on node information is the serial number (number from 0) of the unit corresponding to the depended-on node in the Units
6)Group:
Representing dependency node information
7)Role:
Role of dependent node
8)SubID:
The value is the number of the unit corresponding to the dependent node in the Units (number from 0)
The effect produced by this API is:
1) Adding the language Units in the Units as grid Units into the grid; 2) Adding the Type attribute to the attributes of all grid cells according to the value of the Type; 3) Adding the language unit property information in the POS into the attribute of the corresponding grid unit; 4) The relationship information between language units in Groups is added to the attributes of the grid cells, the relationships between the grid cells, and the grid attributes.
2. API for controlling internal variables of a grid structure
The API realizes the addition, acquisition and test of the internal variables of the grid structure, and improves the controllability and interpretability of the language structure calculation. The method specifically comprises the following API:
2.1 Add class API for internal variables
2.2 obtaining class API for internal variables
2.3 test class API for internal variables
3. API for application data tables
APIs for spreadsheet applications can be divided into three categories: function class, acquisition class, test class. As shown in the following table:
the functional API implements interaction between the grid and the data table, so that the framework has the capability of applying explicit knowledge, and therefore, the functional API is introduced in the following.
1)Segment(TableName)
The function of Segment is to Segment the text in the lattice based on the data table and add attributes.
The specific functions are as follows:
(1) in the current grid, full segmentation is performed from left to right according to data items of the data table.
(2) And adding the attributes in the data table to the corresponding grid unit attributes.
(3) Adding attributes of 'Type' and 'ST' to each cut grid cell, wherein the value of 'Type' is 'Word', and the value of 'ST' is TableName.
2)SetLexicon(TableName)
The function of SetLexicon is to add attributes of data items in a data table to a grid cell and provide application attribute information for the grid cell. During grid calculation, after the data table is called through SetLexicon, when a new grid unit is generated, whether Word in the current grid unit is in the data table or not is automatically searched, or whether the grid unit meets a key value expression in the data table or not is automatically searched, if yes, an attribute is imported, namely, the attribute under the corresponding SetLexicon data item is added into the attribute of the grid unit.
3)Relate(TableName)
The function of relationship is to import a relationship type data table by calling a main table name, a relationship class name, and a relationship name.
When a relationship type data table is imported into a grid by using a relationship function, a data item in the data table corresponds to a grid cell, and a relationship formed by the data item in the master table and a data item in the slave table corresponds to a cell relationship in the grid.
The function implemented by the relationship function can be decomposed into the following steps:
(1) importing the main table data items meeting the Limit condition and the attributes thereof into the grid; and adding an 'ST-Unit' attribute to the grid, wherein the value of the attribute is TableName, and the grid Unit with the source of the data table TableName exists in the grid.
(2) And acquiring all the slave tables corresponding to the data items of the master table through the Coll attribute.
(3) Adding a slave table data item satisfying the Limit condition to the grid;
(4) adding the grid cell relationship, the U-type attribute and the R-type attribute into the grid cells, and in a binary data table taking TableName as a main table, for each relationship < HeadUnit, subUnit, relationship > between the grid cells successfully added into the grid, the HeadUnit and SubUnit respectively have the following attributes:
in addition to adding the above relationship attributes to the grid cells, relationship attributes with relationship sources are additionally added, i.e., ST (here, tableName) is used after Head/Sub of the binary relationship attributes shown in the above table to distinguish the relationship sources, e.g., USub can be expressed as USubTableName, and USub-relationship is expressed as USubTableName-relationship, as shown in detail below.
Properties | Extension |
USub=SubUnit | USubTableName=SubUnit |
USub-Relation=SubUnit | USubTableName-Relation=SubUnit |
RSub=Relation | RSubTableName=Relation |
UHead=HeadUnit | UHeadTableName=HeadUnit |
UHead-Relation=HeadUnit | UHeadTableName-Relation=HeadUnit |
RHead=Relation | RHeadTableName=Relation |
(5) Adding the attributes of the slave table data items to the attributes of the grid cell relationships;
(6) adding attributes of "root", "uroott tablename", "RRoot", "rroottotablename", "ST-relationship" to the grid, for each relationship < HeadUnit, subUnit, relationship > between grid cells successfully added to the grid, the attribute values of the above attributes are respectively: headUnit, relationship, tableName.
4)Str=GetPrefix(TableName,String)
GetPrefix judges whether the character string takes a certain data item in the data table as a prefix string, if so, the longest matching string is returned.
5)Str=GetSuffix(TableName,String)
GetSuffix judges whether the character string takes a certain data item in the data table as a suffix string, and if yes, the longest matching string is returned.
4. API for applying finite state automata
The API realizes the interaction between the grid and the finite state automata, so that the framework has the capacity of efficiently identifying and processing the context. Such APIs include RunFSA and GetFSANode and GetParam, which are described in detail below.
4.1RunFSA(FSAName(,Param))
The API has the functions of executing a finite state automaton, namely completing the matching of Context and grids in the FSA script and executing corresponding Operation, wherein FSAName is the name of the finite state automaton to be executed, param is a parameter needing to be transmitted, and the parameter can be called in the finite state automaton script through API GetParam.
Before executing, finite state automata script needs to be compiled into a finite state transition diagram for calculation, and the process is grammar compiling. When compiled, each control item in the FSA script corresponds to one or more paths in the FSA. One control item comprises a Context and an Operation, each Node in the Context corresponds to one attribute test Node in the FSA path, and the Operation corresponds to an Operation Node in the FSA path.
The following script:
FSA Example
#Include Lib
#ParameterNearby=Yes MaxLen=No Order=Yes
[(K=V&K=V+String)
(K=V|K=VString Unit:K=V)]
{Process1()}
[(K=VTab_C)
SubName]
{Process2()}
sub SubName
([Tab_ATab_B])
NameSpace Lib
functionProcess1()
print("process1")
end
functionProcess2()
print("process2")
end
compilation into a finite state transition diagram is shown in fig. 3.
After being compiled into the finite state transition graph, the nodes on the graph comprise an inlet node, an outlet node, an attribute testing node and an operation node. As shown in fig. 3, there are an Enter node and an Exit node, where the previous node of the Exit containing the Process () content is an Operation node, and the other nodes are attribute test nodes, and the key value expression in the attribute test nodes is used to determine whether the corresponding grid unit satisfies the condition.
The FSA paths specify contexts, and each path, representing a set of context constraints, executes the contents of the corresponding operation node if the context constraints are satisfied. As shown in FIG. 3, in FSA, there are multiple possible paths, starting from Enter to Exit, corresponding to multiple sets of context constraints.
When running RunFSA, the following steps are specifically executed:
(1) and obtaining a corresponding pre-alignment unit according to the # Entry parameter in the FSA script.
(2) The # Parameter in the FSA script is retrieved.
(3) And performing bidirectional matching from the pre-aligned node according to the parameters, and obtaining a path which is successfully matched.
(4) And selecting to execute the operation under which path according to the MaxLen parameter set in the script.
4.2No = GetFSANode (-1) or No1, no2= GetFSANode ("$ Tag")
Executing a finite state automaton, namely completing the matching of the Context in the finite state automaton script and the grid, and if the matching is successful, completing the communication from the Enter to the Exit node, wherein at this time, one or more paths can be used for realizing the communication. In the Operation script, the grid cells corresponding to the nodes in the Context can be accessed, so that the attribute nodes on the FSA path are numbered in sequence, and the grid cells are accessed by referring to the Node numbers. There are two numbering strategies:
1. 0,1,2, … … and n-1 in sequence from left to right;
2. from right to left, sequentially-1, -2, … …, -n.
When the unit set meeting the FSA complete path is found in the grid, the path of the FSA and the grid unit corresponding to the node in the path are determined, the grid unit can be accessed by referring to the node number, and the specific function of the API is to obtain the FSA path number. When the parameter is-1, the number of the nodes in the current path is obtained; when the parameter is "$ Tag", it indicates that the path number corresponding to the node having the Tag label in the Context is obtained, and No1 and No2 are the start path number and the end path number of the Tag, respectively.
4.3Str=GetParam(Key)
The values of the parameters passed in by the RunFSA function are read in the FSA. For example, when RunFSA (FSAName, "Key = Value") is executed, getParam ("Key") may be run in the Operation of the FSAName script of the finite state automaton, i.e., the return Value "may be obtained.
In one embodiment, the language structure calculation using the framework includes the following steps:
s1: analyzing the problem, and judging the required parameter model of implicit knowledge and explicit knowledge according to the problem;
s2: training a required parameter model according to the analysis result of the step one, and collecting required explicit knowledge;
s3: opening the parameter model as a service, and configuring in a configuration file of the frame, namely writing an IP address and a port number of the service into the configuration file; optionally, the text to be analyzed is subjected to batch preprocessing through a parameter model, and an initial language structure obtained after batch processing is used as the input of the frame;
s4: packaging the collected explicit knowledge in a data table file;
s5: writing a master control script and a finite state automata script by using a Lua script language and an API provided by the framework;
s6: indexing the data table file and the finite state automata file by using an indexing function of the executable program;
s7: writing the index paths of the data table file and the finite state automata file into a configuration file;
s8: and executing the language structure calculation script by using the execution function of the executable program, wherein an overall control script and a configuration file are required to be transmitted.
The following description will be made of a specific embodiment of the present framework, taking phrase recognition as an example.
S1: phrase recognition is here embodied to recognize scores, ordinals, and game time in the sports news text and to determine the parametric model and explicit knowledge needed.
1) Observations were made on these three categories of phrases, where scores, like "3:2", "3-2", "3 to 2", "three to two"; ordinal words, shaped as "1 st", "23 th"; the game times are in the form of "first half 34 minutes", "34 minutes".
2) Judging the needed parameter model, the target recognition phrase often exists in a chunk and does not cross the chunk, so that a chunk sequence labeling model needs to be introduced as an auxiliary limiting condition for phrase recognition.
3) Explicit knowledge is needed. In the recognition of the score phrase, the important characteristics are numbers and characters with the meaning of "ratio", wherein the numbers comprise Chinese numbers and Arabic numbers, and the characters with the meaning of "ratio" comprise "ratio", "-", ": ", and thus corresponding attribute descriptions need to be given to these words and characters; in the identification of ordinal word phrases, the important characteristics are numbers and the prefix of representing the order, namely 'the first', so that corresponding attribute descriptions need to be given to the words and characters; in game time phrase recognition, the important characteristics are the time words "top half", "bottom half", "minute" and numbers in the game scene, so that corresponding attribute descriptions need to be given to the words and characters.
S2: training a required parameter model according to the analysis result of the S1, and collecting required explicit knowledge, specifically:
1) And training the block sequence marking model, including preparing training data, designing a training model and carrying out model training. The training data is chunk dependency labeling data, and the model is a Bert + CRF sequence labeling model.
2) Collecting the required explicit knowledge, including characters with the meaning of "ratio" including "ratio", "-", ": ", the prefix" first "indicating the order, the time words" top half "," bottom half "," minute ", and chinese and arabic numerals.
S3: opening the parameter model as a service, and configuring in a configuration file of the framework, namely writing an IP address and a port number of the service into the configuration file, specifically:
1) Opening a chunk sequence marking model as a service;
2) Txt is configured in a configuration file, and the configuration format is as follows:
Server:{"name":"chunk","IP":"127.0.0.0","Port":8080}
the "Server", "name", "IP", and "Port" are reserved words, the "Server" indicates that the line is a configuration item of the external service, and the "name", "IP", and "Port" respectively indicate the name, IP address, and Port number of the external service.
S4: tab's of the data table file.
1) Characters having the meaning of "ratio" include "ratio", "-", ": "the prefix" the first ", which indicates the order, the time words" top half field "," bottom half field "," minute "are enclosed in a descriptive data table named" Merge _ Dict ", wherein these language units with different meanings are treated as individual data items, the specific meaning of which is taken as the attribute of the corresponding data item and is indicated in the form of" K = V ";
2) All numbers, including chinese numbers and arabic numbers, are packaged in a descriptive data table named "Num _ List". The content of the finally formed data table file merge.
Table Merge_Dict
-Entry=Score
:Entry=Score
Than Entry = Score
Entry = Order
Top half Tag = Time
Bottom half Tag = Time
Minute Entry = Minute
Table Num_List
0
1
2
3
4
5
6
7
8
9
Zero
A
II
III
Fourthly
Five of them
Six ingredients
Seven-piece
Eight-bar type
Nine-piece
Ten pieces of cloth
S5: the Lua script language and the API provided by the framework are used for writing the master control script and the finite state automata script.
1) The finite state automata script realizes the feature description of the target recognition phrase and further operation after the feature is satisfied. In the phrase identification task, the characteristic description of scores, ordinal words and match time phrases is involved, and further operation is carried out after the characteristics are met; the characteristics and Operation descriptions of the above three phrases are respectively expressed by a set of Context-Operation pairs, and are written in a finite state automata script named Merge, and finally the script is stored in a finite state automata file Merge. In the finite state automata script named Merge, parameter settings, pre-alignment node settings, and the writing of three sets of Context-Operation pairs are included.
1.1 ) parameter settings. The parameter setting determines the matching mode and the execution mode, and in the task, the matching mode is set as follows: all nodes that are closely matched, sequentially matched, and required to be matched are within the same cluster, i.e., order = Yes Nearby = Yes Bound = class; the execution mode is as follows: only the matching of the maximum length is performed, i.e. MaxLen = Yes.
1.2 Pre-aligned node settings. The pre-alignment node sets and determines the starting point of matching, and on the identification of scores, ordinal words and match time phrases, the pre-alignment nodes are respectively as follows: words with a score meaning, prefixes representing orders, and words representing minutes, in the pre-aligned node parameter set, are represented by key-value expressions, namely: entryScore = [ Entry = Score ], entryOrder = [ Entry = Order ], entryTime = [ Entry = minimum ].
1.3 Write of a Context-Operation pair of score phrases. The identifying characteristics of the score phrases may be described as: one or more numbers, words representing a score meaning, one or more numbers appear in succession, written corresponding to Context as: + Num _ List entry score + Num _ List; the subsequent operations include: all matched nodes are merged from beginning to end and added to be a grid unit, and the attribute of 'Tag = MatchScore' is added to the grid unit.
1.4 Write of Context-Operation pair of ordinal words. The identifying characteristics of the ordinal words may be described as: a prefix, one or more digits following the occurrence, representing the order, written as: entryOrder + Num _ List; the subsequent operations include: all the matched nodes are merged from beginning to end and added into a grid unit, and the attribute of 'Tag = MatchOrder' is added into the grid unit.
1.5 Write of the Context-Operation pair of the game time phrase. The identifying characteristics of the game time phrase may be described as: the "up" and "down" match time words regarding the upper and lower halves, "up", "down", "first", one or more numbers, words representing minutes appear successively, wherein the "up" and "down" match time words regarding the upper and lower halves may or may not appear at the same time, only one of them may or may not appear at the same time, written corresponding to Context: is there a Tag = Time? The [ th ] + Num _ List EntryTime; the subsequent operations include: all matched nodes are merged from beginning to end and added to be a grid unit, and the attribute of 'Tag = MatchTime' is added to the grid unit.
2) Lua realizes the control of the analysis process, and in an example task, the specific process is described as follows:
2.1 Using API SetText To input a text To be analyzed into a grid, where each Word in the text To be analyzed corresponds To a grid cell, and the grid cell has a cell number attribute (Unit), a Type attribute (Type), a starting column number (From) and a terminating column number (To) attribute, a cell content attribute (Word), a core content attribute (HeadWord), and a clause number attribute (ClauseID). For example, the input text to be analyzed is "38 minutes in the bottom half, li Ming hits the 1 st ball, successfully flattening the score to 1-1. "when, form 30 grid cells in the 1 st row of the grid, wherein the grid cell number corresponding to the first character" lower "is" (0,1) ", which indicates that the grid position where it is located is the 0 th column of the 1 st row, and has the following properties: unit = (0,1), type = Char, char = HZ, from =0, to =0, word = down, headWord = down, clauseID =0.
2.2 Using API CallService to call a chunk sequence annotation service to analyze the text to be analyzed, and storing the result in the variable ChunkRet. The input text to be analyzed is' 38 minutes of the bottom half field, li Ming taps the 1 st ball, and the ratio score is successfully leveled to 1-1. "then, the result returned is: "{" Type ": chunk", "Units": [ "38 minutes for bottom half", ",", "Li Ming", "attack", "1 st ball", "," "score", "pull flat to", "1-1", and "". "],
"POS":["NULL-MOD","w","NP-SBJ","VP-PRD","NP-OBJ","w","NULL-MOD","NULL-MOD","VP-PRD","NP-OBJ","w"],"ST":"Chunk"}”
2.3 Use API AddStructure to import the results returned by the chunk sequence annotation service into the grid. Importing the results in 2.2) into the grid, chunks in Units will be added to the grid as grid cells, where grid cells "," and ". "already exists in the grid and thus is not added repeatedly; the marked property sequence corresponding to the chunk sequence, namely the content in the 'POS', is added to the corresponding grid unit as the property attribute (POS) of the grid unit; the contents in "Type" and "ST" are added as a Type attribute (Type) and a source attribute (ST) of the grid cell to all the corresponding grid cells; other attributes, including the Unit number attribute (Unit), the starting column number (From) and ending column number (To) attributes, the Unit content attribute (Word), the core content attribute (HeadWord), and the clause number attribute (ClauseID) are added To the attributes of the newly added grid cell. For example, the grid cell corresponding to "Li Ming" has the following properties: POS = NP-SBJ, type = Chunk, ST = Chunk, unit = (10,1), from =9, to =10, word = Li Ming, headWord = Li Ming, clauseID =1.
2.4 Using API Segment application data table Merge _ Dict to carry out maximum length segmentation from left to right on the current text to be analyzed, when the data item in the data table Merge _ Dict exists in the current grid structure and does not cross the grid unit, segmenting the data item to form a new grid unit, and adding the corresponding attribute in the data table to the segmented grid unit. In the present example, after applying the data table Merge _ fact, "bottom half field", "Minute", "th", "" - "are all cut out to form a new grid cell, and attributes of Tag = Time, entry = min, entry = Order, entry = Score are added respectively.
2.5 Apply finite state automata Merge using API RunFSA and add corresponding attributes to the target-recognized phrases. In this step, the "38 minutes of the lower half field", "1 st" and "1-1" match the Context in the finite state automaton Merge, respectively: "? Tag = Time? [ first ]
+Num_List EntryTime”、“EntryOrder+Num_List”、“+Num_List EntryScore
+ Num _ List ", added as a new grid unit to the grid in the Operation phase, and added with attributes" Tag = MatchTime "," Tag = MatchOrder ", and" Tag = MatchScore ", respectively.
2.6 Get grid cells with attributes "Tag = MatchTime", "Tag = MatchOrder", and "Tag = MatchScore" using the get grid cell API GetUnits, and output them. At this time, the output result is: "38 minutes of the lower half field", "1 st", and "1-1".
S6: indexing the data table file and the finite state automata file by using an indexing function of the executable program;
1) Target is indexed to the data table file, merge. One data table file corresponds to a set of index data, and is generated by gpf.exe indexes, and the index command is as follows: gpf. Exe-table merge. Target./idx, where:
expe, gpf: an indexing tool;
-table: the method comprises the steps of specifying a function type as an index data table;
target: the file name of the data table and gpf.exe are in the same directory;
1./idx/: and storing the path of the index data.
After the data table files are indexed, one file corresponds to two index files, the name of each index file is the concatenation of the name of each data table file and a table, suffixes are idx and dat, and the following two files are formed after indexing:
./idx/Mergetable.idx
./idx/Mergetable.dat
2) The finite state automaton file, merge. A finite state automaton file corresponds to a set of index data, and is generated by fsa. The index command is:
fsa.exe-fsaMerge.fsa./idx/
fsa.exe: a finite state automaton indexing tool;
-fsa: the system is used for appointing the function type as an index finite state automata script;
large. Fsa: finite state automaton script file name;
1./idx/: an index data storage path.
After the script file of the finite state automata is indexed, one file corresponds to two index files, the name of each index file is the concatenation of the name of the script file of the finite state automata and fsa, and suffixes are idx "
And "dat", the following two files are formed after the index of the previous example.
./idx/Mergefsa.idx
./idx/Mergefsa.dat
S7: and writing the index paths of the data table file and the finite state automata file into a configuration file. The application data table, mean.tab, and the finite state automaton, mean.fsa, are written in the configuration file in the Json format, and are configured as follows:
Table:{"Path":"./idx/","Data":["Mergetable"]}
FSA:{"Path":"./idx/","Data":["Mergefsa"]}
line 1, which is the configuration of the Data Table, wherein "Table", "Path" and "Data" are reserved words, "Table" indicates that the line is the configuration item of the Data Table, and "Path" and "Data" respectively indicate the Path of the Data Table index file name and the file name of the Data Table index file, and when a plurality of file names exist under the Data, the index file with the later position is called preferentially;
line 2, which is the configuration of finite state automata script, wherein "FSA", "Path" and "Data" are reserved words, "FSA" indicates that the line is a configuration item of the finite state automata script, and "Path" and "Data" respectively indicate a Path of an index file of the finite state automata script and a file name of the index file, and when a plurality of file names exist under Data, the index file with a back position is preferentially called;
s8: and executing the language structure calculation script by using the execution function of the executable program, wherein an overall control script and a configuration file are required to be transmitted.
The commands for locally running the master control script are as follows:
gpf.exe-luaMerge.lua config.txt
expe, gpf: running the tool;
-lua: the method is used for appointing the function type as the Lua script;
lure.lua: the Lua script file name;
txt: a locally running configuration file.
After the above index command and execute command are executed in the command line prompt, the following will be output in the window: "38 minutes of the lower half field", "1 st", and "1-1", thereby achieving recognition of the target phrase.
The analysis framework for language structure calculation related by the invention has the following characteristics:
1) Knowledge and data are cooperated, and compared with an integrated modeling method for knowledge and data fusion, the cooperation method focuses on respectively completing different tasks by using knowledge and data;
2) Constructing an expert system by knowledge-based symbolic operation, using the expert system as a master control center for calculation, and scheduling other models to jointly complete an overall task;
3) Decomposing the complex task into a plurality of subtasks, and respectively sending the subtasks to model calculation to fully exert the capability of a deep learning model;
4) The framework supports the adoption of an ambiguity resolution strategy based on multi-source characteristics to carry out language structure calculation, namely, an expert system serving as a master control center collects the multi-source characteristics and inputs the multi-source characteristics to a decision model, the decision model can adopt a parameter calculation method such as machine learning and the like to complete decision by using the characteristics and return a decision result to the master control center, and the master control center finishes the output of the whole task.
The framework can better play the role of knowledge, and the flow control of the system conforms to the cognitive process of people. When the method is realized, different strategies can be adopted according to the complexity of the problem, and for simple tasks, the method can be independently finished by an expert system to give a final result; for complex tasks, an expert system can be used as a feature generation component, and decisions can be made according to features through parameter calculation. For example, deep semantic analysis takes knowledge as general control, calls a word meaning resolution model and a decision model, and coordinates the models for solving the problems through structural analysis, word meaning resolution, relation resolution and the like.
The framework can be used for research work such as lexical analysis, syntactic analysis, lexical and syntactic joint analysis, semantic analysis and the like, and can also be used for floor development of natural language understanding application. The application of the framework in language research can be divided into the following aspects:
1) Lexical analysis: compared with the Indonesian system, the Chinese words have no formal boundaries and lack morphological marks, and the Chinese word morphology is regular through Chinese character combination. The framework can be applied to dynamically identify words and internal structure types of the words. For example: overlapping word recognition, clutch word recognition, additional word recognition, and the like.
2) And (3) syntactic analysis: identifying common phrases and internal structures by means of phrase combination form characteristics, such as: temporal phrase recognition, proper noun recognition, etc. By adopting the framework, the structural knowledge is introduced by cooperating with the parameter calculation model, and the syntactic analysis precision is improved.
3) And (3) morphology and syntax joint analysis: the internal structure of the Chinese word is related to the external grammatical function of the word, and the internal structure and the external context of the verb are analyzed integrally.
4) Semantic analysis: and a combination method is adopted, the characteristics supporting semantic analysis are acquired by utilizing the frame knowledge calculation with the help of an intermediate structure, and the characteristics are sent to a decision model to complete semantic disambiguation.
The application of the framework in the floor development of natural language understanding applications can be divided into the following aspects:
1) The framework can also be used as a deep semantic analysis tool, customized development is carried out on the deep semantic analysis tool according to the field, the application problem of fine granularity is solved, and tasks such as semantic analysis meeting scene requirements are realized.
2) The framework can also be used for knowledge acquisition, such as acquisition of various types of vocabulary collocation data; and outputting the entity and event structure of the sentence so as to construct a domain knowledge graph and the like.
The framework can be used for data pre-labeling, and data is pre-labeled according to the attribute label or the relation label of the words injected into the analysis structure on the basis of original input data.
Claims (14)
1. An analysis framework supporting natural language structure computation, comprising:
grid: a data structure is used for storing language structures and is used as a computing platform for bearing various language structures;
data table: the unary knowledge and the binary knowledge are used for packaging the symbol type as a packaging component;
finite state automata: a calculation control component used for representing the language context, which is matched with the script language to complete the control task of the code function structure,
the API system is used for completing the writing of the language structure calculation script by matching with the Lua script language around the grids, the data tables and the finite state automata;
language structure the executable program of the computing script: the index function is provided, and the indexes of the data table and the finite state automata can be realized; the system has an execution function, can execute the language structure calculation script and realize language structure calculation; the system has an executing function for the master control script, writes the data table, the index file path of the finite state automaton, the IP address and the interface of the external service into a configuration file, and transmits the configuration file into an executable program.
2. The analytic framework of claim 1, wherein: the grid includes the following internal variables: the method comprises the steps of analyzing a text to be analyzed, grid attributes, grid units and the relation among the grid units;
the grid cell includes the following internal variables: the stored language units, the attributes of the grid units, and the features and scores of the attributes of the grid units;
the relationships between the grid cells include the following internal variables: attributes of relationships between grid cells, characteristics and scores of relationships between grid cells.
3. The analytic framework supporting natural language structure computation of claim 2, wherein: the language structure is expressed as a directed graph with attribute information, and comprises language units, relations and attributes, and the loading mode of the language structure loaded into the grid is as follows: the language units are borne by the grid units, and each language unit corresponds to one grid unit; relationships between language units are carried by relationships between grid units; the attributes of the language units are carried by the attributes of the grid units; the attributes of the relationships between language units are carried by the attributes of the relationships between grid units.
4. The analytic framework of claim 1, wherein: the language structure is denoted G = (U, R), wherein:
1) U is a node of the graph, here a finite set of language units, noted: u = { tu, au }, tu ∈ Lex, lex is the set of language unit strings, au is the attribute of the language unit, is the set of key-value pairs, and is noted as: au = { K = V }, K is an attribute name, and V is an attribute value;
2) R is a finite set of edges connecting two different nodes in U, i.e., a set of relationships between two language units; since language units are often not dual, it is a directed edge, which is noted as: r = { u = i ,u j R, ar }, where: u. of i ∈U,u j Belongs to U, r belongs to RT, and RT is a set of relation types. ar is the property of the linguistic unit relationship, which is a collection of key-value pairs, noted as: ar = { K = V }, K is an attribute name, and V is an attribute value.
5. The analytic framework supporting natural language structure computation of claim 1, wherein: the API comprises the following steps: (1) External services are called, and a service result is led into an API (application program interface) of the grid, so that the scheduling of the parameter calculation model is realized; (2) The API for controlling the internal variables of the grid structure comprises the addition, the test and the acquisition of the internal variables, so that the controllability of the language structure is realized; (3) The API of the data table is applied, namely, the interaction between the data table and the grid is realized, so that display knowledge is introduced; (4) The API of the finite state automata is applied, namely, the interaction between the finite state automata and the grid is realized, and therefore the functions of the framework are expanded.
6. The analytic framework of claim 1, wherein: the data table is defined as a set of triples, namely: tableName = { < Item, attribute, condition > }, wherein:
TableName: a table name;
item: the data item is Word in a character string form or is a key value expression KV, corresponds to a language unit when being Word, and corresponds to a grid unit when grid computing is carried out; when the data item is KV, all grid cells meeting the condition that KV is true in the corresponding grid are identified by the data item;
attribute: the attribute of Item is a set of key-value pairs, i.e., { K = V }, and { K = V } is brought into the attribute of the grid cell or the relationship attribute between cells;
condition: and applying the definition condition of the Item to be the set of key value expressions, namely { KV }, and continuing the next operation when the grid unit corresponding to the current Item meets one of the { KV } key value expressions, wherein the next operation comprises adding grid units, attributes or establishing a relationship.
7. The analytic framework of claim 1, wherein: the data tables are divided into two types, one type is a description type data table, and the other type is a relational type data table;
the described objects of the described data table are independent language units, the form of the language units and the attribute knowledge of the language units are given, and the knowledge is used for determining the language units and setting the attribute of the language units;
the relational data sheet is characterized in that the written object relates to two language units, one is a central language unit, the other is a language unit having a certain relation with the central language unit, binary relations are packaged by a plurality of data sheets, the relational data sheet is designed into a form of a main sheet and a plurality of auxiliary sheets, the main sheet stores a central language unit list, and the auxiliary sheets store a language unit list forming a certain relation with the central language unit.
8. The analytic framework of claim 1, wherein: the finite state automaton is defined as a set of four tuples, namely: FSAName = { < Enter, path, operation, exit > }, wherein: FSAName: name of finite state automata, globally unique name; enter: the access node is provided with a unique access for the finite state automata; and (4) Path: the path corresponding to the finite state automaton represents context information during structure calculation; operation: an operation node, namely an action to be executed when the context corresponding to the Path is tested successfully; exit: and the exit node is provided with a unique exit by a finite state automaton.
9. The analytic framework of claim 1, wherein: one script can correspond to one or more finite state automata, and the script of each finite state automata comprises an FSA name, a parameter item, a control item and a function library shared by a plurality of finite state automata;
the FSA name: different finite state automata are distinguished through FSA names;
the parameter items are: configuring relevant conditions of current FSA operation;
the control item is: describing different Context conditions and corresponding operations, wherein the Context comprises a Context part and an Operation part, the Context is used for describing a condition part of an FSA path and is composed of a plurality of items; operation describes Operation under corresponding conditions;
the function library: the function library Name is defined in the form of "NameSpace Name", which is a reserved word, and the function body is defined in the form of a Lua script.
10. The analytic framework supporting natural language structure computation of claim 5, wherein: the calling of the external service and the leading of the service result into the API of the grid realize the application of the parameter model, and the method comprises the following steps:
(1) CallService (sequence, serviceName): calling the API of the service, wherein the called service is a parameterized model for carrying out structural analysis, structuralizing the text to be analyzed, and returning the data of the initial language structure; the first parameter is the input to be transmitted to the service, the second parameter is the service name, the configuration is needed in the configuration file, and the return value is the language structure returned by the service;
(2) AddStructure (Sennce _ JSON): language structures in JSon format are injected into the grid.
11. The analytic framework of claim 5, wherein: the API for controlling the internal variables of the grid structure realizes the addition, acquisition and test of the internal variables of the grid structure, improves the controllability and interpretability of language structure calculation, and comprises the following steps: (1) an addition class API of internal variables; (2) an acquisition type API of the internal variable; and (3) testing API of internal variables.
12. The analytic framework of claim 5, wherein: the APIs of the application data table are divided into three types: function type, acquisition type and test type, wherein the function type API realizes the interaction between the grid and the data table, so that the framework has the capability of applying explicit knowledge, and the method comprises the following steps:
1)Segment(TableName)
the Segment has the function of segmenting the text in the grid based on the data table and adding attributes; tableName represents a table name;
2)SetLexicon(TableName)
the SetLexicon function is to add the attribute of the data item in the data table to the grid unit and provide the application attribute information for the grid unit;
3)Relate(TableName)
the function of the relation is to realize the import of the relation type data table by calling the name of the main table, the name of the relation class and the name of the relation;
4)Str=GetPrefix(TableName,String)
GetPrefix judges whether the character string takes a certain data item in the data table as a prefix string, if so, the longest matching string is returned;
5)Str=GetSuffix(TableName,String)
GetSuffix judges whether the character string takes a certain data item in the data table as a suffix string, and if yes, the longest matching string is returned.
13. The analytic framework of claim 12, wherein: when a relationship type data table is imported into a grid by using a relationship function, a data item in the data table corresponds to a grid cell, a relationship formed by the data item in a main table and a data item in a secondary table corresponds to a cell relationship in the grid, and the function realized by the relationship function is decomposed into the following steps:
(1) importing the main table data items meeting the Limit condition and the attributes thereof into the grid; adding an 'ST-Unit' attribute to the grid, wherein the value of the attribute is TableName; the grid unit with the source of the data table TableName exists in the grid;
(2) acquiring all the slave tables corresponding to the data items of the master table through the Coll attribute;
(3) adding a slave table data item satisfying the Limit condition to the grid;
(4) adding the grid cell relation, the U-type attribute and the R-type attribute into a grid cell;
(5) adding the attributes of the slave table data items to the attributes of the grid cell relationships;
(6) adding attributes of 'URoot', 'URoot TableName', 'RRoot', 'RRootTableName' and 'ST-Relations' to the grid, wherein for each relationship < HeadUnit, subUnit, relations > between grid units successfully added to the grid, the attribute values of the above attributes are respectively as follows: headUnit, relationship, tableName.
14. The analytic framework supporting natural language structure computation of claim 5, wherein: the API of the finite state automata is applied to realize the interaction between the grid and the finite state automata, so that the framework has the capacity of efficiently identifying and processing the context; this class of APIs includes runfssa and GetFSANode and GetParam:
(1)RunFSA(FSAName(,Param))
the API has the functions of executing a finite state automaton, namely completing the matching of Context and grids in an FSA script and executing corresponding Operation, wherein FSAName is the name of the finite state automaton to be executed, param is a parameter to be transmitted, and the parameter can be called in the finite state automaton script through an API GetParam;
(2) No = GetFSANode (-1) or No1, no2= GetFSANode ("$ Tag")
Executing a finite state automaton, namely completing the matching of Context in a finite state automaton script and a grid, and completing the communication from the Enter to the Exit node if the matching is successful, wherein one or more paths are communicated at the moment; in the Operation script, the grid cells corresponding to the nodes in the Context can be accessed, so that the attribute nodes on the FSA path are numbered in sequence, and the grid cells are further accessed by referring to the Node numbers.
(3)Str=GetParam(Key)
Reading the parameter value transmitted by the RunFSA function in the FSA.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211333124.2A CN115935943A (en) | 2022-10-28 | 2022-10-28 | Analysis framework supporting natural language structure calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211333124.2A CN115935943A (en) | 2022-10-28 | 2022-10-28 | Analysis framework supporting natural language structure calculation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115935943A true CN115935943A (en) | 2023-04-07 |
Family
ID=86696548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211333124.2A Pending CN115935943A (en) | 2022-10-28 | 2022-10-28 | Analysis framework supporting natural language structure calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115935943A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117608764A (en) * | 2024-01-18 | 2024-02-27 | 成都索贝数码科技股份有限公司 | Container platform operation and maintenance method and system |
-
2022
- 2022-10-28 CN CN202211333124.2A patent/CN115935943A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117608764A (en) * | 2024-01-18 | 2024-02-27 | 成都索贝数码科技股份有限公司 | Container platform operation and maintenance method and system |
CN117608764B (en) * | 2024-01-18 | 2024-04-26 | 成都索贝数码科技股份有限公司 | Container platform operation and maintenance method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11989519B2 (en) | Applied artificial intelligence technology for using natural language processing and concept expression templates to train a natural language generation system | |
Zhang et al. | SG-Net: Syntax guided transformer for language representation | |
US20100083215A1 (en) | Method and an apparatus for automatic extraction of process goals | |
JPS6375835A (en) | Apparatus for generating intended code, program, list and design document | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
Verbruggen et al. | Semantic programming by example with pre-trained models | |
CN113779062A (en) | SQL statement generation method and device, storage medium and electronic equipment | |
CN108766507B (en) | CQL and standard information model openEHR-based clinical quality index calculation method | |
CN114528846A (en) | Concept network for artificial intelligence and generation method thereof | |
CN115935943A (en) | Analysis framework supporting natural language structure calculation | |
Kumar et al. | Deep learning driven natural languages text to SQL query conversion: a survey | |
US11544468B2 (en) | Document text extraction to field-specific computer executable operations | |
Bais et al. | A model of a generic natural language interface for querying database | |
CN110928535B (en) | Derived variable deployment method, device, equipment and readable storage medium | |
CN111753554A (en) | Method and device for generating intention knowledge base | |
CN115469860B (en) | Method and system for automatically generating demand-to-software field model based on instruction set | |
WO2023138078A1 (en) | Method and apparatus for parsing programming language, and non-volatile storage medium | |
Rajbhoj et al. | DocToModel: automated authoring of models from diverse requirements specification documents | |
CN115345153A (en) | Natural language generation method based on concept network | |
CN112099764B (en) | Formal conversion rule-based avionics field requirement standardization method | |
Naghdipour et al. | Ontology-based design pattern selection | |
CN114089980A (en) | Programming processing method, device, interpreter and nonvolatile storage medium | |
CN113468875A (en) | MNet method for semantic analysis of natural language interaction interface of SCADA system | |
Botoeva et al. | Expressivity and complexity of MongoDB (Extended version) | |
Ahkouk et al. | SQLSketch-TVC: Type, value and compatibility based approach for SQL queries: SQLSketch-typed |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |