CN114974253A

CN114974253A - Natural language interpretation method and device based on character image and storage medium

Info

Publication number: CN114974253A
Application number: CN202210553460.1A
Authority: CN
Inventors: 林皓; 高曦; 杨华
Original assignee: Beijing VRV Software Corp Ltd
Current assignee: Beijing VRV Software Corp Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-30

Abstract

The invention provides a natural language interpretation method, a device and a storage medium based on a character portrait, wherein the natural language interpretation method based on the character portrait comprises the following steps: receiving first voice data; determining a first interaction scene to which the first voice data belongs; selecting a first interaction behavior interpretation model corresponding to the first interaction scenario; performing voice recognition on the first voice data to obtain a first recognition result; and correcting the first key words meeting preset conditions in the first recognition result by using the first interactive behavior interpretation model. By determining the first interactive behavior interpretation model corresponding to the first interactive scene and correcting the first keyword in the first recognition result of the first voice data by using the first interactive behavior interpretation model, the efficiency of voice recognition/language interpretation is improved, and meanwhile, the accuracy of voice recognition/language interpretation can also be improved.

Description

Natural language interpretation method and device based on character image and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to a natural language interpretation method and device based on character images and a storage medium.

Background

Along with the continuous development of computer technology, the man-machine interaction mode is more and more diversified and intelligent, at present, more and more interaction platforms all adopt the voice interaction mode, and the voice interaction can improve the interaction efficiency and the promotion interest of users, and becomes an important man-machine interaction mode. For example, a self-service voice customer service system throws a question to a user in a voice mode, and then the user answers the question in a voice mode. Still other examples of navigation systems, shopping systems, etc. require a user to issue voice commands to control what they present. In these scenarios, accurate recognition of the user's speech is required to give correct feedback.

However, the conventional speech recognition method and language interpretation method are not accurate, and cannot perform targeted interpretation and recognition based on the character portrait.

Disclosure of Invention

The invention is based on the above problems, and provides a natural language interpretation method, a device and a storage medium based on a character image, wherein a first interactive behavior interpretation model corresponding to a first interactive scene is determined, and a first keyword in a first recognition result of first voice data is corrected by using the first interactive behavior interpretation model, so that the efficiency of voice recognition/language interpretation is improved, and meanwhile, the accuracy of voice recognition/language interpretation can also be improved.

In view of the above, an aspect of the present invention provides a method for natural language interpretation based on a character image, including:

receiving first voice data;

determining a first interaction scene to which the first voice data belongs;

selecting a first interaction behavior interpretation model corresponding to the first interaction scenario;

performing voice recognition on the first voice data to obtain a first recognition result;

modifying the first key words meeting preset conditions in the first recognition result by using the first interactive behavior interpretation model;

the first interactive behavior interpretation model comprises an incidence relation between interactive scene information and character portrait information.

Optionally, after the step of determining the first interaction scenario to which the first voice data belongs, the method further includes:

prompting a user to make a first action and/or prompting the user to speak first text data and/or prompting the user to input first selection data according to the first interaction scene;

collecting the first action data and/or the first text data and/or the first selection data;

and extracting first key information from the first action data and/or the first text data and/or the first selection data.

Optionally, the step of selecting a first interaction behavior interpretation model corresponding to the first interaction scenario includes:

selecting an interactive behavior interpretation model matched with the first key information based on the corresponding relation between the pre-established key information and the interactive behavior interpretation model;

and determining the interaction behavior interpretation model matched with the first key information as a first interaction behavior interpretation model corresponding to the first interaction scene.

Optionally, the step of correcting, by using the first interactive behavior interpretation model, the first keyword that meets a preset condition in the first recognition result includes:

extracting a first keyword meeting a preset condition from the first recognition result;

extracting a first human character role image from the first interactive behavior interpretation model;

and correcting the first keyword according to the first human character image.

Optionally, the step of determining the first interaction scenario to which the first voice data belongs includes:

extracting first attribute information from the first voice data;

and determining the first interaction scene according to the first attribute information.

Optionally, the first attribute information includes: the first voice data acquisition tool, the acquisition mode, the acquisition time, the acquisition place, the number of people and the semantic environment.

Optionally, after the step of correcting, by using the first interactive behavior interpretation model, the first keyword in the first recognition result, which meets a preset condition, the method further includes:

outputting a first correction result of the first keyword in a voice or text mode;

receiving evaluation feedback of the user on the correction result;

when the evaluation feedback is a forward value, increasing the priority of the first interaction behavior interpretation model corresponding to the first interaction scene;

and when the evaluation feedback is a negative value, reducing the priority of the first interactive behavior interpretation model corresponding to the first interactive scene.

Optionally, before the step of receiving the first voice data, the method further includes:

determining the relationship between a first user and a second user from a user group, and generating a first relationship label by using the respective unique identity labels of the first user and the second user;

acquiring first interaction behavior data between the first user and the second user;

constructing a first character portrait of the first user, a second character portrait of the second user and an interactive behavior database between the first user and the second user according to the first interactive behavior data and the first relation label;

repeating the operation until all users establish character role portraits according to different roles, and an interactive behavior database is established among different character role portraits, and extracting key information from the interactive behavior database;

inputting the data in the interactive behavior database into a trained neural network to obtain a plurality of interactive behavior interpretation models;

and establishing a corresponding relation between the key information and the plurality of interactive behavior interpretation models.

Another aspect of the present invention provides a natural language interpretation apparatus based on a character image, comprising: the device comprises a voice receiving module, a processing module, a voice recognition module and a result correction module;

the voice receiving module is used for receiving first voice data;

the processing module is used for determining a first interaction scene to which the first voice data belongs and selecting a first interaction behavior interpretation model corresponding to the first interaction scene;

the voice recognition module is used for performing voice recognition on the first voice data to obtain a first recognition result;

the result correction module is used for correcting the first keyword which meets the preset condition in the first recognition result by using the first interactive behavior interpretation model;

the first interactive behavior interpretation model comprises an incidence relation between interactive scene information and character role portrait information.

A third aspect of the invention provides a computer-readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method of portrait based natural language interpretation as defined in any of the preceding claims.

By adopting the technical scheme of the invention, the natural language interpretation method based on the character image comprises the following steps: receiving first voice data; determining a first interaction scene to which the first voice data belongs; selecting a first interaction behavior interpretation model corresponding to the first interaction scenario; performing voice recognition on the first voice data to obtain a first recognition result; and correcting the first key words meeting preset conditions in the first recognition result by using the first interactive behavior interpretation model. By determining the first interactive behavior interpretation model corresponding to the first interactive scene and correcting the first keyword in the first recognition result of the first voice data by using the first interactive behavior interpretation model, the efficiency of voice recognition/language interpretation is improved, and the accuracy of voice recognition/language interpretation can also be improved.

Drawings

FIG. 1 is a flowchart of a method for natural language interpretation based on character images according to an embodiment of the present invention;

FIG. 2 is a flowchart of another embodiment of the present invention after the step of determining a first interaction scenario to which the first speech data belongs;

FIG. 3 is a flowchart illustrating the detailed implementation of the step of selecting a first interaction behavior interpretation model corresponding to the first interaction scenario according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating a specific implementation of the step of modifying the first keyword satisfying the predetermined condition in the first recognition result by using the first interactive behavior interpretation model in another embodiment;

FIG. 5 is a flowchart illustrating steps of modifying a first keyword satisfying a predetermined condition in the first recognition result using the first interactive behavior interpretation model in another embodiment;

FIG. 6 is a flowchart of a method for building an interactive behavior interpretation model in another embodiment;

fig. 7 is a schematic block diagram of a character image-based natural language interpretation apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention, taken in conjunction with the accompanying drawings and detailed description, is set forth below. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

A method, apparatus, and storage medium for natural language interpretation based on character images according to some embodiments of the present invention are described below with reference to fig. 1 to 7.

As shown in fig. 1, an embodiment of the present invention provides a method for natural language interpretation based on character images, including:

receiving first voice data;

determining a first interaction scene to which the first voice data belongs;

It can be understood that the natural language interpretation method based on the character image provided by the embodiment of the invention can be applied to intelligent terminals, such as smart phones, computers, smart televisions and the like, and can also be applied to intercom equipment, robots, access control systems and the like.

In the embodiment of the present invention, the first voice data may be acquired through a voice acquisition unit (such as a microphone), or may be acquired from a server or an intelligent terminal through a communication network. And in the process of collecting the first voice data, simultaneously saving the related information of the voice occurrence scene as the first attribute information of the first voice data.

It should be noted that after the first voice data is received, according to the first attribute information carried by the first voice data, a first interaction scene to which the first voice data belongs may be determined, for example, the first attribute information of the first voice data is an acquisition location, and a building (such as a home, a company, a mall, and the like) corresponding to a coordinate may be determined by acquiring a location coordinate; assuming that the collection location is a company, in combination with other first attribute information, such as collection time (e.g. 10 am on monday), number of people (e.g. 5 people; can be determined by voiceprint characteristics), etc., the interaction scene to which the first voice data belongs may be "company meeting". Depending on the actual application scenario, the interaction scenario may include, but is not limited to: family chatting, work discussions, shopping, friend parties, etc.

Further, a first interaction behavior interpretation model corresponding to the first interaction scenario is selected. The first interactive behavior interpretation model comprises an incidence relation between interactive scene information and character portrait information, so that the first recognition result can be specifically interpreted or corrected by determining the corresponding character portrait information.

In the embodiment of the present invention, the first voice data is subjected to voice recognition, the first voice data can be segmented according to different voiceprints through a voice recognition module, the first voice data can also be segmented according to a preset time length, the first voice data can also be segmented according to a preset file size, each segmented voice segment is queued according to the time sequence of voice occurrence, and each voice segment is converted into corresponding text information according to the queuing sequence by using a voice recognition algorithm; and fusing the text information according to the time sequence, and adjusting according to the context to obtain a first recognition result.

And interpreting/correcting the first keyword (such as a term with local characteristics, an industry general language, a professional term and the like) which meets a preset condition (such as appearance frequency and/or error-prone frequency in a preset range) in the first recognition result by using the first interactive behavior interpretation model to obtain a first interpretation/correction result, for example, interpreting/correcting the first keyword by using the figure character image information contained in the first interpretation/correction result and combining the characteristics of the figure.

By adopting the technical scheme of the embodiment, the natural language interpretation method based on the character image comprises the following steps: receiving first voice data; determining a first interaction scene to which the first voice data belongs; selecting a first interaction behavior interpretation model corresponding to the first interaction scenario; performing voice recognition on the first voice data to obtain a first recognition result; and correcting the first key words meeting preset conditions in the first recognition result by using the first interactive behavior interpretation model. By determining the first interactive behavior interpretation model corresponding to the first interactive scene and utilizing the first interactive behavior interpretation model to interpret/correct the first keyword in the first recognition result of the first voice data, the efficiency of voice recognition/language interpretation is improved, and meanwhile, the accuracy of voice recognition/language interpretation can also be improved.

As shown in fig. 2, in some possible embodiments of the present invention, after the step of determining the first interaction scenario to which the first voice data belongs, the method further includes:

It should be noted that, in order to further specify the interaction behavior characteristics of the user in the first interaction scenario to select the optimal interaction behavior interpretation model, in the embodiment of the present invention, some interaction events are constructed according to the first interaction scenario to obtain the interaction behavior of the user in the first interaction scenario, for example, issuing an instruction to prompt the user to make a first action, and/or providing a piece of first text data and issuing an instruction to prompt the user to speak the first text data, and/or providing some options and issuing an instruction to prompt the user to input the first selection data, and so on. And extracting first key information from the first action data and/or the first character data and/or the first selection data by collecting the first action data and/or the first character data and/or the first selection data. The first key information may be a specific action, a specific tone, a specific pronunciation, or a specific preference, and the like, which is not limited by the embodiment of the present invention.

It can be understood that based on the first interaction scenario, the more interaction events are constructed, the more comprehensive the event types are covered, the more interaction behavior data are obtained, and the more accurate the interaction behavior interpretation model is selected later.

In some possible embodiments of the invention, as shown in fig. 3, the step of selecting a first interaction behavior interpretation model corresponding to the first interaction scenario comprises:

It can be understood that, in the embodiment of the present invention, the correspondence between the key information and the interactive behavior interpretation model is established by analyzing the interactive behavior data in different historical interactive scenes and by processing through the neural network. As described above in explaining the first key information, the key information may be a specific action, a specific tone, a specific pronunciation, or a specific preference, etc. And selecting an interactive behavior interpretation model matched with the first key information based on the corresponding relation between the key information and the interactive behavior interpretation model, and taking the interactive behavior interpretation model as a first interactive behavior interpretation model corresponding to the first interactive scene. Through the scheme, the first interaction behavior interpretation model can be selected quickly and accurately, the execution efficiency is improved, and the user experience is improved.

As shown in fig. 4, in some possible embodiments of the present invention, the step of correcting, by using the first interactive behavior interpretation model, the first keyword satisfying the preset condition in the first recognition result includes:

and correcting the first keyword according to the first human character image.

It is understood that, in the embodiment of the present invention, the first recognition result may be text information, and a first keyword (e.g., a local feature word, an industry common language, a professional term, etc.) satisfying a preset condition (e.g., an occurrence frequency and/or an error-prone frequency is within a preset range) is extracted from the first recognition result; and extracting a first character portrait from the first interactive behavior interpretation model, comprehensively analyzing the first recognition result by utilizing character labels (such as a common residence, a place industry, accent characteristics, sex, character relationship and the like) contained in the first character portrait, and interpreting/correcting the first keyword when an error exists to obtain a first interpretation/correction result. In this embodiment, the character figure is used to analyze the first recognition result in a targeted manner, and the first keyword is corrected, so that the recognition preparation rate is greatly improved.

In some possible embodiments of the present invention, the step of determining the first interaction scenario to which the first voice data belongs includes:

extracting first attribute information from the first voice data;

It can be understood that, as described above, in the process of collecting the first voice data, the related information of the voice occurrence scene is simultaneously stored as the first attribute information of the first voice data, specifically, the voice data and the first attribute information are packaged to form the first voice data, or the data format of the voice data is modified, and a part is added to record the first attribute information to form the first voice data.

Wherein the first attribute information includes: the first voice data collection tool (such as a mobile phone, an unmanned aerial vehicle, an office robot, an intelligent camera and the like), a collection mode (such as direct collection through equipment, collection through other equipment connected with a network and the like), collection time (such as 6 am, 9 am, 3 pm, 8 pm and the like), collection places (such as companies, homes, markets, hospitals, schools and the like), figure quantity and semantic environments (mainly including expression, comprehended antecedent and postscript and context).

In the embodiment of the invention, by recording the relevant information of the voice generation scene, one reference dimension is provided for subsequent voice recognition, and the voice recognition efficiency and accuracy are improved.

As shown in fig. 5, in some possible embodiments of the present invention, after the step of correcting, by using the first interactive behavior interpretation model, the first keyword satisfying a preset condition in the first recognition result, the method further includes:

receiving evaluation feedback of the user on the correction result;

It can be understood that, in order to improve the recognition efficiency, the embodiment of the present invention provides a feedback mechanism, which outputs the first interpretation/correction result of the first keyword in the form of voice or text, and provides an interactive interface for the user to evaluate and feed back the recognition result; and receiving evaluation feedback of the user on the correction result, and correspondingly adjusting the priority of the first interaction behavior interpretation model corresponding to the first interaction scene according to the positive and negative values represented by the evaluation feedback.

As shown in fig. 6, in some possible embodiments of the present invention, before the step of receiving the first voice data, the method further includes:

s1, determining the relationship between a first user and a second user from a user group, and generating a first relationship label by using the unique identity labels of the first user and the second user;

it is understood that each user has a unique identity tag, the relationship between users may be a role relationship such as parent, child, couple, friend, colleague or other basic relationship, and the role tag of a user may be constructed by a preset rule based on its unique identity tag, such as adding a role field after the unique identity tag, and the first relationship tag construction may be a fusion of the role tags of the first user and the second user. In this step, any two different users are randomly selected as the first user and the second user.

S2, acquiring first interaction behavior data between the first user and the second user;

in this step, the interactive behavior includes chatting, discussion, teaching, command, etc., and the first interactive behavior data between the first user and the second user is extracted from the interactive behavior data between multiple users/characters (such as voice, motion, text, geographical location, character distance, number of people participating simultaneously, background noise, etc.).

S3, constructing a first character portrait of the first user, a second character portrait of the second user and an interactive behavior database between the first user and the second user according to the first interactive behavior data and the first relation label;

in the step, a character portrait is established by using words, emotions, ages, sexes, education stages, accents, hobbies and the like, an interactive behavior database between every two characters is established based on the character portrait and necessary technical means (keyword identification, emotion identification, attitude analysis and the like), and the interactive behavior database comprises a relationship label (obtained by constructing the first relationship label) of each section of interactive behavior;

s4, repeating the above operations until all users establish character role portraits according to different roles, and an interactive behavior database is established among different character role portraits, and extracting key information from the interactive behavior database;

in this step, the key information, such as the first key information described above, may be a specific action, a specific tone, a specific pronunciation, or a specific preference.

S5, inputting the data in the interactive behavior database into a trained neural network to obtain a plurality of interactive behavior interpretation models;

and S6, establishing the corresponding relation between the key information and the plurality of interactive behavior interpretation models.

In the embodiment, the interaction behavior interpretation model is generated and constructed by using the pre-trained neural network on the basis of the role relationship among the users and the analysis basis of the interaction behavior data, the execution method is simple, and the generated interaction behavior interpretation model can also provide an accurate interpretation result.

In some possible embodiments of the present invention, a verification mechanism is set, a plurality of corresponding interactive behavior interpretation models are set for a same interactive scene, and after a first keyword in a first recognition result of the first speech data is interpreted/corrected by the plurality of interactive behavior interpretation models, a user selects a most correct interpretation/correction result. Specifically, after the first interaction interpretation model is used to interpret/correct the first keyword satisfying the preset condition in the first recognition result to obtain a first interpretation/correction result, the method further includes:

selecting a second interaction behavior interpretation model corresponding to the first interaction scenario;

interpreting/correcting the first key words meeting preset conditions in the first recognition result by using the second interactive behavior interpretation model to obtain a second interpretation/correction result;

and displaying the first interpretation/correction result and the second interpretation/correction result to a user, selecting the optimal user by the user, and adjusting the priority of the first interactive behavior interpretation model and the second interactive behavior interpretation model according to the selection of the user.

As shown in fig. 7, another embodiment of the present invention provides a character image-based natural language interpretation apparatus 700, including: a voice receiving module 701, a processing module 702, a voice recognition module 703 and a result correction module 704;

the voice receiving module 701 is configured to receive first voice data;

the processing module 702 is configured to determine a first interaction scenario to which the first voice data belongs, and select a first interaction behavior interpretation model corresponding to the first interaction scenario;

the voice recognition module 703 is configured to perform voice recognition on the first voice data to obtain a first recognition result;

the result modification module 704 is configured to modify, by using the first interactive behavior interpretation model, a first keyword that meets a preset condition in the first recognition result;

Please refer to the foregoing method embodiments for the operation method of the apparatus provided in this embodiment, which is not described herein again.

Fig. 7 is a hardware composition diagram of the apparatus in this embodiment. It will be appreciated that fig. 7 only shows a simplified design of the device. In practical applications, the apparatuses may also respectively include other necessary elements, including but not limited to any number of input/output systems, processors, controllers, memories, etc., and all apparatuses that can implement the natural language interpretation method of the embodiments of the present application are within the protection scope of the present application.

Another embodiment of the invention provides a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes or set of instructions, which is loaded and executed by a processor to implement a method as in any one of the preceding claims.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications can be easily made by those skilled in the art without departing from the spirit and scope of the present invention, and it is within the scope of the present invention to include different functions, combination of implementation steps, software and hardware implementations.

Claims

1. A natural language interpretation method based on character images is characterized by comprising the following steps:

receiving first voice data;

determining a first interaction scene to which the first voice data belongs;

2. The character image-based natural language interpretation method of claim 1, further comprising, after the step of determining the first interaction scenario to which the first speech data belongs:

3. The character image-based natural language interpretation method of claim 2, wherein the step of selecting the first interaction behavior interpretation model corresponding to the first interaction scenario comprises:

4. The character image-based natural language interpretation method of claim 3, wherein the step of correcting the first keyword satisfying the preset condition in the first recognition result using the first interactive behavior interpretation model comprises:

and correcting the first keyword according to the first human character image.

5. The character image-based natural language interpretation method of claim 4, wherein said step of determining a first interaction scenario to which said first speech data belongs comprises:

extracting first attribute information from the first voice data;

6. The character image-based natural language interpretation method according to claim 5, wherein the first attribute information includes: the first voice data acquisition tool, the acquisition mode, the acquisition time, the acquisition place, the number of people and the semantic environment.

7. The character image-based natural language interpretation method according to claim 6, further comprising, after said step of correcting the first keyword satisfying the predetermined condition in the first recognition result by using the first interactive behavior interpretation model:

receiving evaluation feedback of the user on the correction result;

8. The character image-based natural language interpretation method of claim 7, further comprising, before said step of receiving first speech data:

acquiring first interactive behavior data between the first user and the second user;

9. A natural language interpretation apparatus based on a character image, comprising: the device comprises a voice receiving module, a processing module, a voice recognition module and a result correction module;

the voice receiving module is used for receiving first voice data;

10. A computer-readable storage medium, characterized in that,

the computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by a processor to implement the method for natural language interpretation based on a character image according to any one of claims 1 to 8.