CN108170679B

CN108170679B - Semantic matching method and system based on computer recognizable natural language description

Info

Publication number: CN108170679B
Application number: CN201711460123.3A
Authority: CN
Inventors: 杨学红
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2021-09-03
Anticipated expiration: 2037-12-28
Also published as: CN108170679A

Abstract

The invention belongs to the technical field of programming, and particularly relates to a semantic matching method based on computer recognizable natural language description and a corresponding semantic matching system. The semantic matching method based on the computer recognizable natural language description comprises the following steps: step S1): taking the logic and steps defined by the grammatical rules of the target language as reference, and restricting the natural language requirement description into a structure with the logical steps; step S2): obtaining a candidate word set comprising a root word in the natural language requirement description for a fixed sentence pattern in the constrained natural language requirement description; step S3): segmenting the message name/operation name in the target language to obtain a standby word set comprising the root word in the message name/operation name; step S4): and calculating the matching degree of the candidate word set and the spare word set. The semantic matching method and the semantic matching system can coordinate the divergence of users and developers in the application of natural language, and realize the automatic programming of machine language.

Description

Semantic matching method and system based on computer recognizable natural language description

Technical Field

The invention belongs to the technical field of programming, and particularly relates to a semantic matching method based on computer recognizable natural language description and a corresponding semantic matching system based on computer recognizable natural language description.

Background

Natural language is still the description language of the current software requirements document. The automatic generation of the flow from the functional requirements described by the natural language can not only help users and developers to quickly achieve consensus on the requirements, but also accelerate the development of the flow.

However, as users and developers have different concerns, their descriptions of requirements are often different. In the process that a user and a developer describe functional requirements by using a natural language, the user cares about functions provided by software, performance levels achieved by the software and the like, and the developer can describe the requirements of the software from the technical point of view; moreover, they do not know the naming convention of the specific messages and actions used in the development language, and the real words they use for their description of the requirements are not necessarily the same as the words used in the message names and action names in the development language. In addition, in most cases, users are not familiar with those specialized terms and technical issues.

However, most of the current software requirement documents are still written in natural language, which has two reasons: firstly, most users and developers do not have the capability of formalizing description requirements; secondly, because the natural language vocabulary is rich, the expression ability is strong. However, natural language also has inevitable disadvantages including ambiguity, and inconsistency.

In order to make up for the deficiency of natural language, a method is needed that can constrain and formalize the flow requirement description expressed by natural language so that the computer can understand the requirement. How to coordinate the divergence of the natural language application between users and developers becomes a technical problem to be solved urgently at present.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a semantic matching method based on computer recognizable natural language description and a corresponding semantic matching system based on computer recognizable natural language description, which can effectively eliminate the divergence of users and developers in natural language application and realize the automatic programming of machine language.

The technical scheme adopted for solving the technical problem of the invention is that the semantic matching method based on the computer recognizable natural language description comprises the following steps:

step S1): taking the logic and steps defined by the grammatical rules of the target language as reference, and restricting the natural language requirement description into a structure with the logical steps;

step S2): obtaining a candidate word set comprising a root word in the natural language requirement description for a fixed sentence pattern in the constrained natural language requirement description;

step S3): segmenting the message name/operation name in the target language to obtain a standby word set comprising the root word in the message name/operation name;

step S4): and calculating the matching degree of the candidate word set and the spare word set.

Preferably, the step S2) includes:

step S21): acquiring a demand statement described by a natural language according to a set limiting word, and dividing the demand statement into words to form a primary word set;

step S22): removing stop words in the primary word set to form a suitable word set;

step S23): carrying out synonym expansion on each word in the applicable word set;

step S24): and carrying out root reduction on the expansion word set to obtain a candidate word set comprising the root in the natural language requirement description.

Preferably, in step S21), the qualifier for the requirement sentence converted into the target language setting has the prefix as the mark;

step S22), auxiliary words, prepositions and conjunctions are pre-stored as stop words and are used as a stop word bank;

in step S23), synonym expansion is carried out on each term in the applicable term set according to the synonym thesaurus;

step S24), the root reduction algorithm is Porter algorithm or Lucene algorithm.

Preferably, the step S4) includes the steps of:

step S41): traversing the words of the standby word set, and screening the words which have intersection with the candidate word set;

step S42): and calculating the matching degree of the words meeting the intersection.

Preferably, in step S4), the matching degree between the candidate word set and the spare word set is represented by the following formula:

wherein, count is the number of searched words with similar semantics, | word set_A| is the number of participles in the demand description sentence, | word set_BAnd | is the number of participles in the message name/operation name.

A semantic matching system based on computer recognizable natural language description comprises a constraint module, a candidate word set forming module, a standby word set forming module and a matching module, wherein:

the constraint module is used for describing and constraining natural language requirements into a structure with logical steps by taking the logic and the steps defined by the grammatical rules of the target language as reference;

the candidate word set composition module is used for obtaining a candidate word set comprising a root word in the natural language requirement description for a sentence pattern fixed in the constrained natural language requirement description;

the standby word set forming module is used for segmenting the message name/the operation name in the target language to obtain a standby word set comprising the root word in the message name/the operation name;

and the matching module is used for calculating the matching degree of the candidate word set and the standby word set.

Preferably, the candidate word set composing module includes a primary word set unit, an applicable word set unit, a synonym expansion unit, and a root restoring unit, where:

the primary word set unit is used for acquiring a demand sentence described by a natural language according to a set limiting word, and dividing the demand sentence into words to form a primary word set;

the applicable word set unit is used for removing stop words in the primary word set to form an applicable word set;

the synonym expansion unit is used for carrying out synonym expansion on each term in the applicable term set;

and the root reduction unit is used for carrying out root reduction on the expansion word set to obtain a candidate word set comprising the root in the natural language requirement description.

Preferably, in the primary term set unit, the qualifier set for converting the requirement statement into the target language takes the prefix as the identifier;

in the applicable word set unit, auxiliary words, prepositions and conjunctions are pre-stored as stop words and are used as a stop word bank;

in the synonym expansion unit, synonym expansion is carried out on each term in the applicable term set according to the synonym thesaurus;

in the root reduction unit, the root reduction algorithm is Porter algorithm or Lucene algorithm.

Preferably, the matching module includes an intersection unit and a matching unit, wherein:

the intersection unit is used for traversing the words of the standby word set and screening the words which have intersection with the candidate word set;

and the matching unit is used for calculating the matching degree of the words meeting the intersection.

Preferably, in the matching unit, a formula of a matching degree between the candidate word set and the spare word set is as follows:

The invention has the beneficial effects that: the semantic matching method based on the computer recognizable natural language description and the corresponding semantic matching system thereof increase synonym expansion and modify similar calculation on the basis of word segmentation, stop word removal, root reduction and similar calculation so as to be suitable for matching with message names/operation names in demand description, can coordinate divergence of users and developers in natural language application, and realize automatic programming of machine language.

Drawings

FIG. 1 is a flow chart of a semantic matching method based on computer recognizable natural language description according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the steps of obtaining a set of candidate words including a root word in a requirement description according to an embodiment of the present invention;

FIG. 3 is a block diagram of a semantic matching system based on computer recognizable natural language description according to an embodiment of the present invention;

in the figure:

1-a constraint module; 2-candidate word set composition module; 3-a standby word set forming module; 4-matching module.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following describes the computer recognizable natural language description based semantic matching method and the corresponding computer recognizable natural language description based semantic matching system in further detail with reference to the accompanying drawings and the detailed description.

In order to establish a bridge between requirement description and development language, the invention provides a semantic matching method based on computer recognizable natural language description, which is based on a word stock with hierarchy (which can be understood as English dictionary word stock) formed on the basis of root words and synonyms from the perspective of semantic matching, can coordinate divergence of users and developers on natural language application, realizes automatic programming of machine language, and greatly accelerates project progress.

As shown in fig. 1, the semantic matching method based on computer recognizable natural language description in the present invention includes the following steps:

step S1): the natural language requirement description is constrained to the structure of the logical steps with reference to the logic and steps defined by the grammatical rules of the target language.

The flow function requirement described by the natural language has certain step performance, and the step performance is embodied by prepositions in sentences, such as after, if, then, or else, at the same time and the like. However, the simple logical step relationships for human beings are not easily recognized and understood by computers. Therefore, a constraint rule is required to be specified, and preparation before conversion from the natural language to the target language is carried out, so that users and developers can carry out requirement description according to the constraint rule, and the requirement description can be directly embodied in step for a computer. The target language may be a computer language selected for programming.

In the step, the natural language requirement description is restricted to be presented as a structure with logic steps, and words forming the structure with logic steps can be integrated to form a word stock word set_A。

Step S2): and acquiring a candidate word set comprising the root word in the natural language requirement description for the fixed sentence pattern in the constrained natural language requirement description.

When describing the functional requirements by natural language, users and developers do not know the naming information of specific messages and operations in the program file, and the real words used by the users and developers for describing the requirements are not necessarily the same as the words used in the message names and operation names in the program file. In this step, the fixed sentence pattern in the constrained requirement description is automatically formalized with reference to the logic and steps defined by the grammatical rules of the target language by the computer. Formalization is typically done in a target development language (e.g., in the automated business process assembly language BPEL), i.e., converting the requirement description into a language that can be understood by a computer. The invention aims at the process combination language, and formally converts the requirement description with certain step after constraint processing into the corresponding statement of the process combination language. Thus, a bridge between the requirement description of the flow and the target language is achieved through formalization.

Therefore, in this step, from the perspective of semantic matching, a synonym thesaurus is used, a matching algorithm is performed on the natural language requirement description based on the root word and the synonym, and a candidate term set including the root word in the natural language requirement description is obtained for the sentence pattern fixed in the natural language requirement description after constraint.

In the following, referring to fig. 2, the requirement statement a described in natural language is used to obtain the final word set_AThe process of (a) will be described in detail. The method specifically comprises the following steps:

step S21): and acquiring a demand statement described by a natural language according to the set qualifier, and segmenting the demand statement to form a primary term set.

Therein, fromHowever, after the requirement statement A described by the language is constrained and formalized, a constraint statement A' is obtained. If the constraint sentence A ' comprises the set qualifier of the person, extracting the natural language requirement description sentence of the constraint sentence A ', and performing word segmentation to obtain the primary word set of word set '_A. In general, the preliminary words are specified by the requirement description in the target language to be converted, and therefore, the qualifier may be set in advance to be retrieved from the thesaurus in the target language.

Preferably, the qualifier for converting the requirement statement a into the target language setting can be identified by a prefix, for example, the automated business process assembly language BPEL is taken as an example, and the prefix of the constraint statement a' is [ RECEIVE]Or [ INVOKE]Then the natural language requirement description sentence of the extraction constraint sentence A 'is A', and the segmented primary word set of A 'is word set'_A. Here, [ RECEIVE]Indicates acceptance of a message, INVOKE]Indicating that a service is invoked.

Step S22): and removing stop words in the primary word set to form an applicable word set.

In general, in a requirement sentence, besides real words such as names, adjectives, moving parts of speech, etc., there may be also dummy words without actual meanings such as auxiliary words, prepositions, conjunctions, etc., and based on the purpose of finding the most semantically matched message and operation with the requirement sentence from all target documents, the words irrelevant to semantics will interfere with semantic matching, so it is necessary to eliminate them in the process of calculating the matching degree. Therefore, it is further preferable that, in order to ensure the purity of the thesaurus, the auxiliary words, prepositions, conjunctions, and the like are stored in advance as stop words as the stop word thesaurus D. According to the stop word lexicon D, the word set is wordset'_ARemoving stop words, i.e. from wordset'_ARemoving stop words to obtain a suitable word set

For wordset'_AIf w ∈ D, then

Step S23): and carrying out synonym expansion on each word in the applicable word set.

In this step, the set of applicable words is assembled according to the thesaurus C (which can be understood as a general English dictionary)

And carrying out synonym expansion on each word in the Chinese sentence. For the

Any term w in the synonym thesaurus C is inquired about the synonym set synnyms (w) of the term w, and all synonyms of the term w are added into the synonym thesaurus C

In (1),

get the expanded word set word "_A。

Step S24): and carrying out root restoration on the expansion word set.

In the case of expanding word set word "_AIn the step of root reduction, for word set "_ACalculating the root of word w 'of any word w by using a root reduction algorithm, and replacing word by w'_AW, obtaining a candidate word set wordset including a root word in a natural language requirement description_AI.e. wordset_A＝wordset"_A-w + w'. Here, w' is denoted porter (w). The specific root reduction algorithm may be a Porter algorithm or a Lucene algorithm, which is not limited herein.

Through the steps, the processing of dividing, removing stop words, synonym expansion and root reduction is carried out on the required sentences described by the natural language in sequence, and the roots in the sentences described by the natural language and the expansion of the roots in the same level with the roots can be obtained without being interfered by the stop words, so that semantic expansion and containment of users and developers in the communication process can be realized to the maximum extent, and a richer candidate matching basis is provided for the conversion of the computer language.

Step S3): and segmenting the message name/operation name in the target language to obtain a standby word set comprising the root word in the message name/operation name.

In this step, the message name/operation name is segmented, and the formed word set is a fixed sentence pattern in the constrained requirement description. The alternative word set of the message name/operation name B after word segmentation is wordset_B。

It should be understood here that the definition of the message name/action name B requires a specific language specific analysis, since each computer language has specificity.

The invention automatically converts the flow function requirement of natural language description into the application of development language description, and adds the matching algorithm of root word and synonym in the aspect of semantic processing for improving the accuracy. Therefore, the candidate word set obtained in step S2) and the alternative word set obtained in step S3) are subjected to matching degree calculation to ensure that the words in the candidate word set are matched with the maximum similarity in the alternative word set.

Currently, the matching degree calculation method includes a Dice-Euclidean similarity calculation method. In this embodiment, in order to more accurately search the flow corresponding to the natural language, the similarity calculation algorithm Dice algorithm is improved in consideration of the root of a word and the synonym, and the DicePlus algorithm is used for calculating the word set_AAnd wordset_BThe degree of matching of (2).

The improved extended similarity calculation algorithm DicePlus comprises the steps of:

step S41): and traversing the words of the standby word set, and screening the words which have intersection with the candidate word set.

In this step, the spare word set wordset is traversed_BIf each word in wordset_BWord w in wordset_AIn (1), or synonyms of the word w in wordset_AExists in the word block so as to judge the spare word set wordset_BWord in and candidate word set wordset_AWhether there is an intersection between the words in (1).

The matching degree is calculated in order to find a program statement meeting the matching degree to replace a corresponding requirement description statement, and if the program statement can not be found, developers need to write the corresponding statement by themselves. In this step, the candidate word set wordset is calculated using the following formula_AAnd the spare word set word_BDegree of matching of

Based on the matching degree algorithm and the similarity algorithm, the requirement of natural language description can be converted into a description language which can be identified by a computer, and the automatic computer programming of statements described according to the natural language can be realized. At this time, even though the real words used by the user requirement description are not necessarily identical to the words used by the developer (e.g. both receive and get indicate receiving the message), accurate matching can still be performed.

The natural language and the computer language are in continuous development and updating, the semantic matching method of the invention cannot have exhaustiveness, can be added by self-learning in the future use, slowly accumulates word stock, and continuously enriches and perfects matching.

The semantic matching method based on the computer recognizable natural language description adds synonym expansion and modifies similar calculation on the basis of word segmentation, stop word removal, root reduction and similar calculation so as to be suitable for matching with message names/operation names in demand description, and can coordinate divergence of users and developers in natural language application and realize automatic programming of machine language.

Correspondingly, the embodiment also provides a semantic matching system based on computer recognizable natural language description, which can coordinate divergence of users and developers on natural language application and realize automatic programming of machine voice.

As shown in fig. 3, the semantic matching system based on computer recognizable natural language description comprises a constraint module 1, a candidate word set composition module 2, an alternative word set composition module 3 and a matching module 4, wherein:

a constraint module 1, which is used for describing and constraining natural language requirements into a structure with logical steps by taking the logic and steps defined by the grammatical rules of the target language as reference;

a candidate word set composition module 2, configured to obtain a candidate word set including a root in the natural language requirement description for a sentence pattern fixed in the constrained natural language requirement description;

a standby word set composition module 3, configured to perform word segmentation on the message name/operation name in the target language, and obtain a standby word set including a root word in the message name/operation name;

and the matching module 4 is used for calculating the matching degree of the candidate word set and the standby word set.

The candidate word set composition module 2 includes a primary word set unit, a suitable word set unit, a synonym expansion unit, and a root reduction unit, wherein:

and the primary word set unit is used for acquiring the demand sentences described by the natural language according to the set limiting words and dividing the demand sentences into words to form primary word sets. In the primary word set unit, converting the requirement sentence into a qualifier set by a target language, and using prefix as an identifier;

and the applicable word set unit is used for removing stop words in the primary word set to form an applicable word set. In the applicable word set unit, auxiliary words, prepositions and conjunctions are pre-stored as stop words and are used as a stop word bank;

and the synonym expansion unit is used for carrying out synonym expansion on each term in the applicable term set. In the synonym expansion unit, performing synonym expansion on each term in the applicable term set according to the synonym thesaurus;

and the root reduction unit is used for carrying out root reduction on the expansion word set to obtain a candidate word set comprising the root in the natural language requirement description. In the root reduction unit, the root reduction algorithm is Porter algorithm or Lucene algorithm.

The matching module 4 comprises an intersection unit and a matching unit, wherein:

In the matching unit, the formula of the matching degree of the candidate word set and the spare word set is as follows:

The semantic matching system based on the computer recognizable natural language description adds synonym expansion and modifies similar calculation on the basis of word segmentation, stop word removal, root reduction and similar calculation so as to be suitable for matching with message names/operation names in demand description, and can coordinate divergence of users and developers in natural language application and realize automatic programming of machine language.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A semantic matching method based on computer recognizable natural language description is characterized by comprising the following steps:

step S4): calculating the matching degree of the candidate word set and the standby word set so as to replace corresponding sentences in the natural language requirement description with target languages meeting the matching degree, wherein the target languages are computer languages selected for programming;

step S4) includes the steps of:

2. The semantic matching method based on computer recognizable natural language description according to claim 1, wherein the step S2) includes:

3. The computer recognizable natural language description based semantic matching method according to claim 2,

step S21), transferring the requirement sentence into a qualifier set by the target language, wherein the prefix is used as a mark;

4. The semantic matching method based on computer recognizable natural language description as claimed in claim 1, wherein in step S4), the formula of the matching degree of the candidate word set and the alternative word set is:

5. A semantic matching system based on computer recognizable natural language description is characterized by comprising a constraint module, a candidate word set forming module, a standby word set forming module and a matching module, wherein:

the matching module is used for calculating the matching degree of the candidate word set and the standby word set so as to replace corresponding sentences in the natural language requirement description with target languages meeting the matching degree, wherein the target languages are computer languages selected for programming;

the matching module comprises an intersection unit and a matching unit, wherein:

6. The computer recognizable natural language description based semantic matching system of claim 5, wherein the candidate word set composition module comprises a primary word set unit, an applicable word set unit, a synonym expansion unit, and a root restoration unit, wherein:

7. The computer recognizable natural language description based semantic matching system of claim 6,

in the primary word set unit, prefixes of qualifiers set by converting the requirement sentences into the target language are used as identifiers;

8. The semantic matching system based on computer recognizable natural language description as claimed in claim 5, wherein in the matching unit, the formula of the matching degree of the candidate word set and the alternative word set is: