CN107704453B

CN107704453B - Character semantic analysis method, character semantic analysis terminal and storage medium

Info

Publication number: CN107704453B
Application number: CN201710995052.0A
Authority: CN
Inventors: 胡明灯
Original assignee: Shenzhen Qianhai Zhongxing Scientific Research Co ltd
Current assignee: Shenzhen Yuanxing Internet Technology Co ltd
Priority date: 2017-10-23
Filing date: 2017-10-23
Publication date: 2021-10-08
Anticipated expiration: 2037-10-23
Also published as: CN107704453A

Abstract

The invention provides a character semantic analysis method, a character semantic analysis terminal and a storage medium, wherein character strings contained in character information are separated into independent words by receiving the character information input by a user, so that a word sequence is obtained; carrying out syntactic analysis on the separated word sequences, and judging whether syntactic errors exist in the word sequences; converting words contained in the word sequence into corresponding metadata, calculating semantic similarity and feature item weight among the metadata, extracting keyword feature items of the word sequence to obtain semantic tagged texts corresponding to the words, establishing a text database, matching the semantic tagged texts from the text database in sequence according to the arrangement sequence of the words in the word sequence, and outputting and displaying the sequenced and synthesized text information. The invention feeds back the information to the user through the format of the metadata, thereby facilitating the user to obtain the information fed back by the semantic analysis terminal and correctly understand and use the information.

Description

Character semantic analysis method, character semantic analysis terminal and storage medium

Technical Field

The present invention relates to the technical field of semantic analysis, and in particular, to a text semantic analysis method, a text semantic analysis terminal, and a storage medium.

Background

At present, an interactive mode between human and machines is a text conversation mode, information acquisition and filtering cannot achieve the expected purpose, and the meaning of the speech spoken by the current user cannot be accurately identified, for example, "can you go in the sea? "but the machine can be understood as meaning" the sea is not at home ", and the meaning of our user is" can we go to the sea to eat? Although the text type conversation is adopted, the meaning expressed by the human can be varied, and the semantic analysis method of the text conversation has the following inconveniences:

firstly, generally, the meaning expressed by a user is rich in unique human emotion, and if the simple text conversation semantic analysis method is adopted, the machine cannot identify the meaning really expressed by the user; in fact, even though the machine may recognize most of the user's meanings, the meanings that may be expressed by the machine are different; thirdly, if the human-computer conversation is the simple text conversation and the data is not encrypted, sampled and analyzed and output encrypted, the safety of the information cannot be guaranteed, so that the information can be easily cracked and obtained by people with thoughts or hackers, and the transmission of the data information is not facilitated.

Accordingly, further improvements are needed in the art.

Disclosure of Invention

In view of the above technical problems, embodiments of the present invention provide a text semantic analysis method, a text semantic analysis terminal, and a storage medium, so as to help an existing human-computer conversation not to identify a true meaning of information expressed by a user, and solve a problem of an information transfer error.

A first aspect of an embodiment of the present invention provides a method for semantic analysis of characters, where the method for semantic analysis of characters includes the following steps:

receiving character information input by a user, performing lexical analysis on the input character information, and separating character strings contained in the character information into independent words to obtain a word sequence;

carrying out syntactic analysis on the separated word sequences, judging whether grammatical errors exist in the word sequences, and filtering out words with grammatical errors or phrases formed by adjacent words;

converting words contained in a word sequence into corresponding metadata, calculating semantic similarity and feature item weight among the metadata, extracting keyword feature items of the word sequence according to the calculated semantic similarity and feature item weight, obtaining semantic label texts corresponding to the words according to the keyword feature items, and storing the semantic label texts in a text database;

and matching corresponding semantic mark texts from the text database in sequence according to the arrangement sequence of each word in the word sequence, and outputting and displaying the text information synthesized after sequencing.

Optionally, the text information input by the user includes: identity information of the user and question information input by the user;

the identity information of the user comprises: user ID information byte, user name byte and mobile phone number byte.

Optionally, the step of separating the character string contained in the text information into independent words includes:

and using a blank space as a separator to separate the character string contained in the character information into independent words, and setting a unique corresponding number identification and a next metadata pointing identification for each word.

Optionally, before receiving the text information input by the user, the method further includes:

creating a metadata base for storing metadata, and establishing an association relation between a word catalogue and the metadata contained in the metadata base;

in the step of converting the words contained in the word sequence into corresponding metadata, the metadata corresponding to the words is found out through the association relationship.

Optionally, the step of calculating semantic similarity and feature weight between metadata, and extracting the keyword feature of the word sequence according to the calculated semantic similarity and feature weight includes:

and calculating semantic similarity and feature item weight among metadata by adopting a word similarity analysis method based on a corpus and a word vector space model.

A second aspect of the embodiments of the present invention provides a text semantic analysis terminal, where the text semantic analysis terminal includes: a processor, a memory, and a word semantic analysis program stored on the memory and executable on the processor, wherein the word semantic analysis program when executed by the processor performs the steps of:

Optionally, when executed by the processor, the text semantic analysis program further implements the following steps:

A third aspect of the embodiments of the present invention provides a computer-readable storage medium, where a text semantic analysis program is stored on the computer-readable storage medium, and when executed by a processor, the text semantic analysis program implements the text semantic analysis method.

In the technical scheme provided by the embodiment of the invention, the information input by the user is stored in the form of metadata, the metadata can be properly analyzed and identified, and then the information is fed back to the user through the structural format of the metadata, so that the information irrelevant to the user is removed and only the information concerned by the user is pushed to the user when the information is fed back to the user, and the user can conveniently obtain the information fed back by the machine and correctly understand and use the information.

Drawings

FIG. 1 is a flow chart illustrating the steps of a text semantic analysis method according to the present invention;

FIG. 2 is a schematic block diagram illustrating the principle of the text semantic analysis method according to the present invention;

FIG. 3 is a flowchart of steps of a specific application embodiment of the text semantic analysis method according to the present invention;

fig. 4 is a schematic structural block diagram of the text semantic analysis terminal according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In computer terminology, semantic analysis is a logical phase of the compilation process, and the task of semantic analysis is to perform type-based review of the context-dependent nature of a structurally correct source program. While the architecturally incorrect source program cannot enter the review stage, it is possible that the architecturally incorrect source program may be correct in context, type, and only reports an error when the program is compiled. Semantic analysis is to examine whether a source program has semantic errors or not and collect type information for a code generation stage. One task, such as semantic analysis, is to perform type checking, to check whether each operator has an operand allowed by the language specification, and when not meeting the language specification, the compiler should report an error. Some compilers report errors for cases where real numbers are used as array indices. Also, for example, some programs specify that operands may be forced, and when such operations are applied to an integer and a real object, the compiler should convert the integer into the real and not be considered an error in the source program.

At present, the communication between people is carried out smoothly mainly by using languages and characters as tools, the meaning expressed by people is correctly understood, the conversation between people and human is in a character mode, a computer machine can only recognize two numerical symbols of '0' and '1', the human-computer conversation needs to be transmitted through a computer instruction, in the transmission process, data such as the instruction and the like are firstly input into the computer through an input device, the processing result is stored in the computer, and finally the processing result is displayed through an output device of the computer, so that people can read and listen. However, in the process of data storage and transmission, a series of processing needs to be performed on the data to achieve smooth communication between people and machines, so that correct communication between people is achieved. The metadata management mode adopted by the invention provides guarantee and implementation mechanisms for the process.

Metadata, which is actually a coding scheme, is data that describes other data; the coding system is commonly used for describing digital information resources, particularly network information resources; it is also a structured data; metadata refers to structured data extracted from information resources, such as course names, speakers, duration and the like, and used for organizing, retrieving, describing, storing and managing information and knowledge resources, and used for explaining the characteristics and the content of the information resources; such as the lecture information (information resources) of the teacher of our online club lecture, we can retrieve the obtained information in the club application, such as the course name: quality management, mainly speaking: teacher and Wei, time of lecture: 6 and 21 months in 2017. Because a basic metadata is composed of metadata items and metadata contents, after the metadata is used for describing resources, the resources can be effectively filtered and classified, and the standard specification of the metadata is added, so that the effective contents and the unavailable contents of resource information can be distinguished, and the correct meaning of the information can be well expressed; over the years of development, the format of metadata has been able to support the format of xml, html, etc., which is convenient for people to customize tags themselves, that is, so-called metadata, and through the mode of such tags, users can firstly look at tags (metadata) when using data so as to be able to obtain information required by themselves, and metadata supports the expansion of metadata through the use of attributes.

The invention provides a semantic analysis method, as shown in fig. 1, the analysis method comprises the following steps:

step 101, receiving character information input by a user, performing lexical analysis on the input character information, and separating character strings contained in the character information into independent words to obtain a word sequence.

In this step, firstly, the text information sent by the user through the client is received. In the implementation, the user uses the client, such as: and the app installed in the mobile terminal sends the text information, and the client sends the received text information to the server.

Specifically, the text information input by the user includes: identity information of the user and question information input by the user;

It is conceivable that the user identity information may be information that needs to be input by the user each time the user sends information, or the user identity information may be stored first, and when the user needs to send information, the problem information input by the user and the pre-stored identity information are packaged and sent.

In this step, the step of separating the character string included in the character information into independent words includes:

Because the information input by the user is all characters, the step firstly carries out lexical analysis on the input information, sequentially divides the character strings according to the formats of the words, identifies the words in the character strings, and kicks out the characters which cannot be identified and combined.

And 102, carrying out syntactic analysis on the separated word sequences, judging whether grammatical errors exist in the word sequences, and filtering out words with grammatical errors or phrases formed by adjacent words.

The separated word sequences are parsed to determine whether there are word combinations that do not conform to the grammar, codes are generated by assigning attributes of the language structure to non-terminal characters representing the language structure, and attribute values are calculated by semantic rules attached to grammar production formulas to perform grammar-directed translation, and semantic translation of context-free grammars.

The method also comprises the following steps: through the analysis and judgment of the assignment statement, the arithmetic expression and the logic expression in the word sequence, inconsistent word groups of variable types are filtered.

Step 103, converting words contained in the word sequence into corresponding metadata, calculating semantic similarity and feature item weight between the metadata, extracting keyword feature items of the word sequence according to the calculated semantic similarity and feature item weight, obtaining semantic label texts corresponding to the words according to the keyword feature items, and storing the semantic label texts in a text database.

And converting each word into corresponding metadata, and performing semantic analysis on information input by a user by establishing a metadata model to obtain the intention of the information.

Before the step of receiving the text information input by the user, the method further comprises the following steps:

Specifically, semantic analysis of text conversation and user information is performed on the basis of metadata management. The semantic analysis is to obtain key information of the problems input by the user by calculating semantic similarity and feature item weight between metadata, and to build semantic tagged texts of the problems input by the user according to the key information, that is, to execute semantic tags of text conversations through semantic analysis, and to store text documents with semantic tags into a tagged text database (metadata database).

Preferably, the step of calculating the semantic similarity and the feature weight between the metadata and extracting the keyword feature item of the word sequence according to the calculated semantic similarity and the feature item weight includes:

And step 104, matching corresponding semantic mark texts from the text database in sequence according to the arrangement sequence of each word in the word sequence, and outputting and displaying the text information synthesized after sequencing.

Because the obtained semantic tagged text documents corresponding to the word sequence are respectively independent information and are not combined into text information, in the step, the semantic tagged text documents of the independent information are sorted according to the unique corresponding serial number identifier of each word and the pointing identifier of the metadata corresponding to the next word, and the text information is synthesized and output. The text information is the correct expression of the user input question.

Fig. 2 is a schematic block diagram illustrating an interaction flow of a text conversation semantic analysis method based on metadata management according to an embodiment of the present invention, and for convenience of description, the method of the present invention is further explained with reference to fig. 3. The method of the embodiment of the invention comprises the following steps:

and step H1, inputting relevant text information and sending a request to the terminal after the user opens the client or the application in the mobile phone.

The request includes identity information of the user and issue information entered by the user.

After a user inputs information through the application of the mobile phone end, the user information and the information input by the user can be stored by the application and are stored in a database; at this point the application will issue a request to the machine that the content contains the user information and the entered information. As a specific implementation manner, the input information includes a user ID information byte, a user name byte, a mobile phone number byte, a title byte, and a submission time byte.

And step H2, the server terminal receives the request sent by the client and performs primary lexical analysis on the information input by the client.

And when the server terminal receives the information input by the user and transmitted by the client, the server terminal transmits data to the background server. In the process of data transmission, the server needs to perform preliminary preprocessing operation on the information of the user and perform information lexical analysis.

Specifically, the lexical analysis is as follows: the method comprises the steps of scanning user input information in a left-to-right sequence, identifying various words according to the lexical rules of the language, and generating attribute words of the corresponding words. That is, converting a character sequence input by a user into a word (Token) sequence. Then, qualitative and fixed-length processing is carried out on the recognized words.

By preprocessing the user input information, it is possible to classify words, such as the input information "I am Chinese", since the computer does not know that the two words are separated by a space, and only knows that the words are a character string composed of ordinary characters. The morphemes may be segmented from the input string by some method, here using spaces as separators. The segmented result can be expressed in XML as follows: < sensor >

<word>Chinese</word>

</sentence>

And step H3, carrying out grammar analysis on the word sequence obtained in the step H2, identifying errors in information grammar, and filtering.

The grammar analysis is also a logic stage of the compiling process, and the task of the grammar analysis is to combine word sequences into various grammar phrases on the basis of the lexical analysis, then judge the structure of the word sequences, judge whether the word sequences are normal or not, and describe the structure through context-free grammar.

Step H4, converting the words in the word sequence into metadata, performing semantic analysis on the metadata to obtain a semantic tagged text corresponding to the user input information, and storing the semantic tagged text in a text database;

after the lexical analysis and the syntactic analysis, the information data are basically available, but the problems of ambiguity and understanding inequality cannot be solved, at this time, the data format is classified and recombined, the data format is converted into a structure mode of metadata for storage, then the metadata is systematically managed, the processing mode of converting the data into the metadata is realized, then semantic analysis is carried out, the true information purpose and intention of a user are obtained, namely, the word sequence is sequentially carried out: and converting the word sequence into a corresponding metadata sequence after processing of semantic expression, semantic organization, semantic storage and ambiguity elimination.

Before, the source program is subjected to lexical analysis and syntactic analysis, and the semantic analysis work is performed in the third stage, which is the most substantial work of a compiler. In the first two steps, lexical analysis and syntactic analysis are both used for recognizing and processing the form of the source program, and semantic analysis is used for explaining the semantics of the source program to cause the sending quality of the source program to change. The semantic analysis mainly comprises the following steps: grammar guide translation, symbol table, type check, intermediate language, and generation of intermediate code. The machine performs semantic analysis on the data information when the background server acquires the data information transmitted by the front end, and the invention encapsulates the data information into a metadata model to perform semantic analysis operation. The semantic analysis module is internally provided with an ontology and an entity dictionary. The ontology is used for performing semantic analysis on the text, basic composition units of the ontology are concepts, the concepts form a concept tree, and the concept tree forms the ontology. Text conceptualization solves the problem of word ambiguity or word ambiguity. The entity dictionary is used for performing entity extraction on the text so as to abandon the content without actual meaning in the text and simplify the calculated amount of subsequent text processing, reasoning is performed through frame logic or description logic, data in an information source is collected, mode information of each local database is stored in a metadata database according to a specified format, a global ontology of a corresponding field is established by analyzing semantic relations among metadata, semantic marking of the text document is performed through semantic analysis, and the text document with the semantic marking is stored in a marked text document database.

Specifically, the semantic similarity is used for analyzing the similarity degree between two words, is mainly used in the fields of text word disambiguation, information retrieval, information extraction, machine translation and the like, and has strong subjectivity, so the semantic similarity cannot be analyzed without a specific application environment. At present, two calculation methods exist in the semantic similarity analysis field, one is that the concept of related words is organized in a tree structure through a semantic dictionary to calculate; the other method is to solve the problem by using a statistical method through the information of the word context. In combination with the application scenario of the invention, the algorithms of the invention adopting semantic similarity and feature item weight calculation are all the existing mature algorithms: the method adopts a word similarity analysis method based on a corpus and adopts an algorithm formula:

Sim(W1,W2)＝aDis(W1,W2)+a；

wherein, the similarity is Sim (W1, W2), a is an adjustable parameter, and the meaning is: the distance between the words W1, W2 is Dis (W1, W2) when the similarity is 0.5. The weight calculation formula of the characteristic term is as follows: w is tf multiplied by idf, wherein w is the weight value of the feature item t in the document d, tf represents the frequency of t occurring within d, and idf represents the inversely proportional text frequency of t. The method is adopted, and the word vector space model is widely applied, and comprises the following steps: preprocessing- > text feature item selection- > weighting- > generating a vector space model and then calculating the cosine. The model obtains a feature word vector of the relevance of each word by selecting a group of feature words in advance and then calculating the relevance of the group of feature words and each word, and the similarity between the vectors is used as the similarity between the two words.

After metadata conversion and semantic analysis are carried out on user data, a machine generates corresponding correct answers according to data information and stores the correct answers in a database to serve as an information source of an output end.

Step H5, after semantic analysis is carried out on the user data, the machine generates the user data into an application knowledge base system according to corresponding standards, the characteristics of each data are clearly identified in the knowledge base system, after the user inputs information, the machine searches and selects the knowledge base to find the matched data to respond, namely, the semantic analysis result is stored in the semantic knowledge base, after the user inputs information, the matched knowledge is obtained by detecting from the knowledge base, and then the needed analysis result is obtained by semantic association discovery.

Although data information is converted through metadata and analyzed and answers are generated based on semantics on a metadata structure, the data information cannot be immediately output to a user side for display, because the information at the moment is not coherent and belongs to an isolated and dispersed state, the data needs to be further processed at the moment, a relationship is established between the data and the data, and by establishing the relationship, because each metadata data has a unique identifier which is provided with a number identification input by a user and a pointing identification of the next metadata, after the data input by the user is started, the data information is automatically searched in a question knowledge base, a corresponding question answer data text is searched, the text and the text are combined to form a corresponding final result of a question input by the user, and then the machine can feed back the information synthesized by the whole text to the user as a response of the machine to the user, to meet the user's intent.

A second aspect of the embodiment of the present invention provides a text semantic analysis terminal, as shown in fig. 3, where the text semantic analysis terminal 10 includes: a processor 110, a memory 120, and a text semantic analysis program stored on the memory and executable on the processor, wherein the text semantic analysis program when executed by the processor performs the steps of:

Further, when executed by the processor 110, the text semantic analysis program further implements the following steps:

Preferably, when executed by the processor 110, the text semantic analysis program further implements the following steps:

creating a metadata base for storing metadata, and establishing an association relation between a word catalogue and the metadata contained in the metadata base; and the directories contained in the metadata database establish different hierarchies according to different metadata types, so that the corresponding metadata can be inquired more quickly according to the directories.

Memory 120, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 110 executes various functional applications of the server and data processing by running the nonvolatile software programs, instructions and modules stored in the memory 120, that is, implements the text semantic analysis method of the above method embodiment.

The memory 120 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the report automatic generation system, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to a text semantic analysis terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 120 and, when executed by the one or more processors 110, perform the text semantic analysis method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

In the invention, when a user needs to acquire information resources, the user sends a corresponding instruction command to the machine, and the machine acquires the command of the user and further stores the command information of the user; in the invention, the data information is stored through the format of the metadata, when the information resource of the user is stored in the metadata, the metadata can be properly analyzed and identified, and then the information is fed back to the user through the structural format of the metadata, and when the information is fed back to the user, the information irrelevant to the user is removed, and only the information concerned by the user is pushed, so that the user can conveniently obtain the information fed back by the semantic analysis terminal, and correctly understand and use the information.

It should be understood that the technical solutions and concepts of the present invention may be equally replaced or changed by those skilled in the art, and all such changes or substitutions should fall within the protection scope of the appended claims.

Claims

1. A method for analyzing word semantics is characterized by comprising the following steps:

matching corresponding semantic mark texts from the text database in sequence according to the arrangement sequence of each word in the word sequence, and outputting and displaying the text information synthesized after sequencing;

the algorithms adopting semantic similarity and feature item weight calculation are the existing mature algorithms: the method adopts a word similarity analysis method based on a corpus and adopts an algorithm formula:

Sim(W1,W2)＝aDis(W1,W2)+a；

wherein, the similarity is Sim (W1, W2), a is an adjustable parameter, and means: the distance between the words W1 and W2 is Dis (W1 and W2) when the similarity is 0.5; the weight calculation formula of the characteristic term is as follows: w is tf multiplied by idf, wherein w is the weight value of the feature item t in the document d, tf represents the frequency of t occurring in d, and idf represents the inversely proportional text frequency of t; the method is adopted, and the word vector space model is widely applied, and comprises the following steps: preprocessing- > text feature item selection- > weighting- > generating a vector space model and then calculating cosine; the model obtains a feature word vector of the relevance of each word by selecting a group of feature words in advance and then calculating the relevance of the group of feature words and each word, and the similarity between the vectors is used as the similarity between the two words.

2. The text semantic analysis method according to claim 1, wherein the text information input by the user comprises: identity information of the user and question information input by the user;

3. The method for semantic analysis of words according to claim 2, wherein the step of separating the character string included in the word information into independent words comprises:

4. The text semantic analysis method according to claim 3, wherein before receiving text information input by a user, the text semantic analysis method further comprises the steps of:

5. The method for analyzing word semantics according to claim 4, wherein the step of calculating the semantic similarity and feature item weight between the metadata and extracting the keyword feature item of the word sequence according to the calculated semantic similarity and feature item weight comprises:

6. A character semantic analysis terminal, characterized by comprising: a processor, a memory, and a word semantic analysis program stored on the memory and executable on the processor, wherein the word semantic analysis program when executed by the processor performs the steps of:

Sim(W1,W2)＝aDis(W1,W2)+a；

7. The text semantic analysis terminal according to claim 6, wherein the text semantic analysis program further implements the following steps when executed by the processor:

8. The text semantic analysis terminal according to claim 7, wherein the text semantic analysis program further implements the following steps when executed by the processor:

9. The text semantic analysis terminal according to claim 7, wherein the text semantic analysis program further implements the following steps when executed by the processor:

10. A computer-readable storage medium, wherein a text semantic analysis program is stored on the computer-readable storage medium, and when executed by a processor, implements the text semantic analysis method according to any one of claims 1 to 5.