CN114925206A

CN114925206A - Artificial intelligence body, voice information recognition method, storage medium and program product

Info

Publication number: CN114925206A
Application number: CN202111561200.0A
Authority: CN
Inventors: 左珑; 喻祥
Original assignee: Shenzhen Pudu Technology Co Ltd
Current assignee: Shenzhen Pudu Technology Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-08-19

Abstract

The application relates to an artificial intelligence, a voice information recognition method, a storage medium, and a program product. The artificial intelligence body comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor is used for realizing the following steps when calling and executing the computer program: responding to the target voice information, and identifying semantic information of the target voice information through a preset semantic identification model; and determining response information according to the target voice information and the semantic information, and outputting the response information. The semantic recognition model is obtained through training of large sample labeling data, the large sample labeling data is obtained through data expansion processing of small sample labeling data through a pre-constructed semantic knowledge graph, and the semantic knowledge graph comprises corpora in various human-computer interaction scenes. Therefore, the semantic information of the target voice information can be accurately recognized through the semantic recognition model, response information which is more in line with user interaction expectation is output, and the human-computer interaction effect is better.

Description

Artificial intelligence body, voice information recognition method, storage medium and program product

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to an artificial intelligence, a speech information recognition method, a storage medium, and a program product.

Background

With the rapid development of science and technology, the artificial intelligence technology is also increasingly applied to the daily life of people, so that more convenience is brought to the daily life of people.

In the related art, smart devices such as an in-vehicle system, a voice robot, and a mobile phone mostly have a function of recognizing and executing a voice instruction, so that a user can interact with the smart device through a voice dialog system.

However, in the related art, there is a problem that the accuracy of speech information recognition is low when a speech dialogue is performed.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an artificial intelligence body, a voice information recognition method, a storage medium, and a program product capable of improving the accuracy of voice information recognition.

In a first aspect, the present application provides an artificial intelligence body. The artificial intelligence body comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for realizing the following steps when calling and executing the computer program:

responding to the target voice information, and identifying semantic information of the target voice information through a preset semantic identification model;

the semantic identification model is obtained by training large sample labeled data, and the large sample labeled data is obtained by performing data expansion processing on small sample labeled data through a pre-constructed semantic knowledge graph; the semantic knowledge graph comprises corpora in various human-computer interaction scenes;

and determining response information according to the target voice information and the semantic information, and outputting the response information.

In one embodiment, the processor is further configured to invoke and execute the computer program and to perform the following steps:

obtaining corpora in various human-computer interaction scenes;

extracting corresponding triple information from the linguistic data in each man-machine interaction scene; the triple information comprises entities, relations and attributes;

and constructing a semantic knowledge graph according to the triple information.

In one embodiment, the data augmentation process comprises: data transformation processing and data addition and deletion processing;

the processor is used for calling and executing the computer program and further realizes the following steps:

acquiring small sample labeling data in each man-machine interaction scene;

carrying out data transformation processing on the small sample labeling data to obtain expansion sample data;

and performing data addition and deletion processing on the expansion sample data to obtain large sample labeling data.

In one embodiment, the data transformation processing is performed on the small sample labeled data to obtain extended sample data, and the method includes:

determining a target entity in each small sample marking data and entity attributes corresponding to the target entity; the target entity comprises at least one entity;

acquiring entities of the same type of the target entity from the semantic knowledge graph according to the entity attribute corresponding to the target entity;

and replacing the target entity in the small sample labeling data with the same type entity to obtain the expansion sample data.

In one embodiment, the target entities in the small sample labeling data are replaced by the same type of entities, including:

acquiring the position information of a target entity in the small sample labeling data;

and replacing the target entity in the small sample labeling data with the same type entity according to the position information.

In one embodiment, the data addition and deletion processing is performed on the extended sample data to obtain the large sample labeled data, and the method includes:

based on a preset prefix and suffix database, randomly adding a prefix and/or a suffix to the expansion sample data to obtain first expansion data;

and performing language word addition and deletion operation on the first expansion data to obtain large sample labeling data.

In one embodiment, determining the response information according to the target voice information and the semantic information comprises:

determining attributes of a reference entity and a reference entity in target voice information;

acquiring at least one candidate entity corresponding to the reference entity from the semantic knowledge graph according to the attribute of the reference entity;

determining an entity with the highest similarity with the reference entity in the at least one candidate entity as a standard entity;

and determining response information according to the semantic information of the standard entity and the target voice information.

and updating the semantic knowledge graph according to the linguistic data in each man-machine interaction scene at the current moment according to a preset period.

In a second aspect, the application further provides a speech information recognition method. The method comprising the steps carried out by an artificial intelligence agent according to any one of the first aspects above.

In a third aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps performed by the artificial intelligence of any of the first aspects described above.

In a fourth aspect, the present application further provides a computer program product. The computer program product comprising a computer program that when executed by a processor performs the steps performed by the artificial intelligence agent of any of the first aspects.

The artificial intelligence, the voice information recognition method, the storage medium and the program product respond to the target voice information and recognize the semantic information of the target voice information through the preset semantic recognition model; and determining response information according to the target voice information and the semantic information, and outputting the response information. The semantic identification model is obtained by training large sample labeling data, and the large sample labeling data is obtained by performing data expansion processing on small sample labeling data through a pre-constructed semantic knowledge graph; the semantic knowledge graph comprises corpora in various human-computer interaction scenes. According to the method and the device, the semantic knowledge graph is constructed based on corpora in various human-computer interaction scenes, and the semantic knowledge graph comprises a large number of entities and entity relations in language texts generated in human-computer interaction, so that the constructed semantic knowledge graph can provide data support for learning of a semantic recognition model. Furthermore, because the small sample labeling data is the language sentence pattern which is labeled artificially, after a large amount of linguistic data is obtained from the semantic knowledge map, the language sentence pattern which is labeled in the small sample labeling data can be referenced to label automatically to obtain the large sample labeling data, so that the labeling cost of mass data which is labeled artificially is reduced, the labeling speed is increased, and the reliability of the labeling data is also improved. In addition, the semantic recognition model is trained by expanding the processed large sample labeling data, so that the semantic recognition model can learn more interactive corpora, and the training effect of the semantic recognition model is improved. Therefore, in a man-machine voice conversation scene, after the artificial intelligence receives the target voice information, the trained semantic recognition model is used for recognizing and analyzing the target voice information, so that the semantic information of the target voice information is accurately determined, and the recognition accuracy of the semantic information is improved. Further, based on accurate semantic information and target voice information, the artificial intelligence body can output response information which is more in line with interaction expectation, and the human-computer interaction effect is better.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a method for speech information recognition;

FIG. 2 is a flow diagram illustrating a method for speech information recognition according to one embodiment;

FIG. 3 is a schematic flow diagram of a semantic knowledge graph construction method in one embodiment;

FIG. 4 is a partial schematic diagram of a semantic knowledge graph in one embodiment;

FIG. 5 is a flow diagram that illustrates a method for data enhancement, according to one embodiment;

FIG. 6 is a diagram of a data transformation process in one embodiment;

FIG. 7 is a diagram of a data transformation process in another embodiment;

FIG. 8 is a flow chart illustrating a method of data enhancement in another embodiment;

FIG. 9 is a flowchart illustrating a method for recognizing speech information according to another embodiment;

FIG. 10 is a block diagram showing the structure of a speech information recognition apparatus according to an embodiment;

FIG. 11 is a diagram illustrating the internal structure of an artificial intelligence in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

With the development of science and technology and the progress of technology, the artificial intelligence technology is more and more applied to the daily life of people, and brings more convenience to the life of people. For example, the artificial intelligence may interact with the user through a dialog system to respond to the user's voice information, output corresponding response information, and/or perform corresponding actions.

The dialog system refers to a computer system capable of performing coherent interaction with human beings, and the artificial intelligence body provided with the dialog system may be, but is not limited to: the voice assistant releases the hands of the user and executes various actions such as ticket booking and meal booking; the intelligent customer service replaces manual customer service and carries out voice conversation with the user; intelligent terminal equipment such as intelligent audio amplifier, intelligent projecting apparatus. In addition, the artificial intelligence body can also exist as a separate voice interaction robot so as to perform voice interaction with human beings in an actual application scene, and further provide corresponding services and the like, which are not limited by the application.

In a dialogue system, an artificial intelligence needs to be able to recognize the user's voice information and the user's semantic information, and then, according to entities in the semantic information and the voice information, query the response information needed by the user from a database, and output the response information to the user.

In the related art, the artificial intelligence body recognizes and analyzes the voice information of the user, and can be realized through a semantic recognition model. The recognition effect of the semantic recognition model is based on the quality and quantity of the model training data, so in order to improve the recognition effect of the semantic recognition model, a large amount of labeled data is required to be adopted for training.

However, in some fields or human-computer interaction scenarios, less voice information is generated in human-computer interaction, and the voice information is difficult to acquire, resulting in less data for training of the semantic recognition model. Furthermore, the training data are marked manually in advance, so that the data marking cost is high, and in order to reduce human errors in the marking process as much as possible, certain professional requirements are required on marking personnel, so that the marking is more difficult.

In addition, the semantic recognition model obtained by training of massive sample labeling data still has the problem of limited generalization capability, cannot have wide preposed knowledge and logic reasoning capability like human, and has the conditions of semantic recognition error and/or entity extraction error and the like when voice information recognition and analysis are carried out. Some reasons are that the labeling data have errors, and some reasons are that the labeling data are still limited although being massive, and as the social development changes day by day, new language vocabularies will emerge continuously, so that the semantic recognition model needs to train the new labeling data again.

Based on the above, the application provides an artificial intelligence body, a voice information recognition method, a storage medium and a program product, so that when the artificial intelligence body and a user perform voice conversation, the target voice information of the user is recognized and analyzed through a semantic recognition model, the semantic information of the target voice information is accurately determined, and the semantic recognition accuracy is improved.

The voice information recognition method can be applied to artificial intelligence bodies. The artificial intelligent agent can be any intelligent agent for realizing man-machine voice conversation in an actual environment, can be used as an intelligent component in a terminal, such as an intelligent customer service and a voice assistant in the terminal, and can also be used as a single terminal, such as an intelligent sound box, a voice interaction robot and the like.

As one example, the internal structure of the artificial intelligence agent is shown in FIG. 1, wherein a processor is configured to output a response message in response to a received voice message. The memory in the internal structure includes a nonvolatile storage medium and an internal memory, the nonvolatile storage medium storing an operating system, a computer program, and a database; the internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database is used for storing data such as corpora and the like in various human-computer interaction scenes. The network interface is used for communicating with an external terminal through network connection. The computer program is executed by a processor to realize the voice information recognition method provided by the application.

The following describes in detail the technical solutions of the embodiments of the present application and how to solve the above technical problems with the embodiments of the present application by using the embodiments and with reference to the drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. It should be noted that, in the speech information recognition method provided in the embodiment of the present application, the execution subject may be an artificial intelligence, or may also be a speech information recognition apparatus, and the apparatus may be implemented as part or all of a processor by software, hardware, or a combination of software and hardware. It is to be understood that the embodiments described are only some of the embodiments of the present application and not all of them.

In one embodiment, the present application provides an artificial intelligence. The artificial intelligence body comprises a memory in which a computer program is stored and a processor. As shown in fig. 2, the processor calls and executes the computer program to implement the following steps:

step 210: and responding to the target voice information, and identifying the semantic information of the target voice information through a preset semantic identification model.

The semantic identification model is obtained by training large sample labeling data, and the large sample labeling data is obtained by performing data expansion processing on small sample labeling data through a pre-constructed semantic knowledge graph; the semantic knowledge graph comprises corpora in various human-computer interaction scenes.

It should be noted that the target voice information may be the voice content of any voice of any user received by the artificial intelligence, and the semantic information is the meaning of the target voice information and reflects the real conversation intention of the user.

In one possible implementation manner, the implementation procedure of step 210 may be: the artificial intelligence body receives the target voice information, carries out entity recognition on the target voice information through the semantic recognition model, and analyzes the semantic information of the target voice information according to the relation between recognized entities.

As an example, the target voice information is: "how the weather is in the Shanghai today", the entities in the target voice message are: today, Shanghai and weather, and further determining semantic information of the target voice information as follows according to the attributes and relationships among the entities: the weather is queried.

In addition, before step 210, the process of obtaining the semantic recognition model may be: acquiring manually marked small sample marking data in various human-computer interaction scenes; performing data expansion processing on the small sample labeling data through the constructed semantic knowledge graph to obtain massive large sample labeling data; and training the initial semantic recognition model through the large sample labeling data to obtain a trained semantic recognition model.

The large sample labeling data comprises a plurality of groups of labeling data, and each group of labeling data comprises a voice text and semantic information of the voice text.

Further, the implementation process of training the initial semantic recognition model by using the large sample labeling data may be as follows: and sequentially taking multiple groups of labeled data in the large sample labeled data as the input of the initial semantic recognition model, training the initial semantic recognition model until semantic information output by the initial semantic recognition model meets a preset convergence condition, and determining the convergence of the initial semantic recognition model to obtain the semantic recognition model.

In addition, in order to verify the learning effect of the model, the large sample labeling data may further include a speech text which is not labeled with semantic information, and in the training period, the learning effect of the initial semantic recognition model is verified through the speech text which is not labeled with semantic information.

It should be noted that, when training the initial semantic recognition model, iterative training may be performed in a supervised learning manner, or in an unsupervised learning manner, which is not limited in the embodiment of the present application.

As an example, the training process of supervised learning may specifically be: and taking a plurality of groups of labeled data as input of the initial semantic recognition model, performing iterative training on the initial semantic recognition model until the error between the semantic information recognized by the initial semantic recognition model and the labeled semantic information of each voice text is smaller than a preset value, meeting a preset convergence condition, determining that the initial semantic recognition model is converged, and obtaining the semantic recognition model.

Step 220: and determining response information according to the target voice information and the semantic information, and outputting the response information.

The response information may be output in a form of voice, a form of text language, or a form of graph, and specifically, which form is adopted to output the response information may be determined by combining an actual situation or an interaction requirement of a user, which is not limited in the present application.

In a possible implementation manner, the artificial intelligence body determines the real intention of the user to output the target voice information to the artificial intelligence body according to the semantic information, and then acquires corresponding response content from the database based on the semantic information and the target voice information so as to determine the finally output response information according to the response content.

As an example, when the target speech information is: how the weather is in Shanghai today, the semantic information is as follows: the weather is queried. The artificial intelligence body determines the corresponding date according to the entity 'today' in the target voice information, determines the location according to the entity 'Shanghai', and further obtains the weather condition of the Shanghai today from the database according to the semantic information.

The weather condition may be a change condition of a whole day, or may be a real-time condition of the current time point. Moreover, weather conditions include, but are not limited to: rain and snow on cloudy or sunny days, temperature, humidity, wind direction, wind power, heat preservation index, ultraviolet intensity, visibility and occurrence probability of special disastrous weather.

Furthermore, the weather conditions of Shanghai and today can be output in a voice broadcasting mode, visual display can be performed in a display interface of the artificial intelligent body in a chart mode, and text description information of the weather conditions can be directly output.

It should be understood that the above description is only given by way of example of a dialogue scenario in which weather conditions are queried by an artificial intelligence, and is not intended to limit the specific application scenario in which the artificial intelligence performs the above speech information recognition method.

In the embodiment of the application, the semantic knowledge graph is constructed based on corpora in various human-computer interaction scenes, and the semantic knowledge graph comprises a large number of entities and entity relations in language texts generated in human-computer interaction, so that the constructed semantic knowledge graph can provide data support for learning of a semantic recognition model. Furthermore, because the small sample labeling data is the language sentence pattern which is labeled artificially, after a large amount of linguistic data is obtained from the semantic knowledge map, the language sentence pattern which is labeled in the small sample labeling data can be referenced to label automatically to obtain the large sample labeling data, so that the labeling cost of mass data which is labeled artificially is reduced, the labeling speed is increased, and the reliability of the labeling data is also improved. In addition, the semantic recognition model is trained by expanding the processed large sample labeling data, so that more interactive corpora can be learned by the semantic recognition model, and the training effect of the semantic recognition model is improved. Therefore, in a man-machine voice conversation scene, after the artificial intelligence receives the target voice information, the trained semantic recognition model is used for recognizing and analyzing the target voice information, so that the semantic information of the target voice information is accurately determined, and the recognition accuracy of the semantic information is improved. Further, based on accurate semantic information and target voice information, the artificial intelligence body can output response information which is more in line with interaction expectation, and the human-computer interaction effect is better.

Based on the embodiment, the semantic knowledge graph construction method is further provided, so that the semantic knowledge graph comprising corpora in various human-computer interaction scenes is constructed through the semantic knowledge graph construction method, and more sample labeling data are provided for training a semantic recognition model.

Among them, the concept of Knowledge Graph (KG) was formally proposed in 2012 by google, and the initial objective was to optimize the performance of its search engine. In essence, a knowledgegraph is a semantic network that expresses various types of entities and the semantic relationships between them. In other words, a knowledge-graph is a kind of directed graph with different types of entities as nodes and various relationships between entities as edges.

In one embodiment, as shown in fig. 3, the semantic knowledge graph building method provided by the present application is described by taking the method as an example of applying the method to the artificial intelligence body in fig. 1, and includes the following steps:

step 310: and obtaining corpora in various human-computer interaction scenes.

The corpus is voice texts which are output to the artificial intelligence body by different users when different interaction intents are expressed in different human-computer interaction scenes. It should be understood that, in the same scene and with the same semantic information, the different language expression modes and/or expression habits of different users may cause the phonetic text to be different.

As an example, in a human-computer interaction scenario where music is played through an artificial intelligence, the semantic information is to play a piece of music. The corpus obtained may include: you are good, put a work a to listen; randomly playing a song of X week for listening; work a from week X.

As another example, in a human-computer interaction scenario where the playing of a movie is controlled by an artificial intelligence, the semantic information is to play a movie of lie X, and the obtained corpus may include: playing 'dreams space' of the Li X director; a film of the Li X actor is played for watching; put down the xx movie bar of Li X.

In one possible implementation manner, the implementation process of step 310 may be: the voice data is obtained from various human-computer interaction scenes through a crawler mode, and furthermore, voice texts output to the artificial intelligence bodies by users are preprocessed (such as weight screening, cleaning, conversion, sorting and classification) to obtain linguistic data in various human-computer interaction scenes.

Step 320: extracting corresponding triple information from the linguistic data in each man-machine interaction scene; the triple information includes entities, relationships and attributes.

In one possible implementation manner, the implementation procedure of step 320 may be: and extracting information of the corpora in each man-machine interaction scene to obtain corresponding entity, relation, entity attribute and other structured information. Further, the three-element information is obtained by expressing various types of knowledge extracted from the real world into a structural language text which can be stored and calculated by a computer through knowledge representation.

The information extraction comprises entity extraction, relation extraction and attribute extraction. Entity extraction is also called Named Entity Recognition (NER), which is the automatic Recognition of Named entities from voice texts, and the main methods comprise a rule-based method, a statistical machine learning method, information extraction facing an open domain and the like; however, the speech text is extracted by entities to obtain a series of discrete named entities, and in order to obtain semantic information, the relations between the entities need to be analyzed from the speech text, and the entities are linked by the relations to form a network knowledge structure. The main methods of relation extraction include manual construction of grammar and semantic rules, statistical machine learning, and open domain-oriented relation extraction techniques. Further, the purpose of attribute extraction is to collect attribute information of a specific entity from different information sources, for example, for a certain public person, information such as a nickname, a birthday, a nationality, an education background and the like can be obtained from network public information. The attribute extraction technology can collect the information from various data sources, and complete drawing of the entity attributes is achieved.

Further, the knowledge representation is typically a graph that symbolically describes relationships between entities using Resource Description Framework (RDF) triples (SPO).

Specifically, the triples may be represented in a (entity, attribute value) manner, or in a (entity, relationship, entity) manner, which is not limited in this embodiment of the present application.

Step 330: and constructing a semantic knowledge graph according to the triple information.

The constructed semantic knowledge graph can be a knowledge graph meeting a single human-computer interaction scene or a comprehensive knowledge graph meeting various human-computer interaction scenes, and the semantic knowledge graph is not limited in the embodiment of the application.

In addition, when the semantic knowledge graph is constructed, a mode (Schema) of the semantic knowledge graph needs to be set first to limit the format of the corpus to be added into the semantic knowledge graph. In other words, a corpus model in each human-computer interaction scene is determined, wherein the corpus model comprises meaningful entity types in the scene and entity attributes corresponding to the entity types.

Therefore, the structured expression of the linguistic data in the semantic knowledge graph is normalized by setting the mode, and only when one linguistic data meets the entity object and the type which are defined in the mode in advance, the linguistic data is allowed to be updated into the semantic knowledge graph.

It should be noted that, by executing the step 320, knowledge elements such as entities, relationships, attributes, and the like can be extracted from the original corpus, but these results may include a large amount of redundant and erroneous information, and the relationships between data are also flattened, which lacks hierarchy and logic.

Therefore, in one possible implementation manner, the implementation procedure of step 330 may be: and cleaning and integrating the entities, the relations and the attributes in the triple information through knowledge fusion, eliminating ambiguity between entity designated items and entity objects, and obtaining a series of basic fact expressions. And further, carrying out knowledge processing on the basic facts after knowledge fusion, wherein the knowledge processing comprises ontology construction, knowledge reasoning, quality evaluation and the like, and obtaining the structured and networked semantic knowledge map.

As an example, as shown in fig. 4, the triplet information composed of the related information of singer a for the music field includes: (singer A, identity, singer), (singer A, identity, producer), (singer A, identity, director), (singer A, place of birth, Taiwan), (singer A, ethnic, Han), (singer A, nationality, China), (singer A, singing, music 1), (singer A, singing, music 2), (singer A, singing, music 3), and (singer A, singing, music 4). Further, for the singer a whose wife is the actor E, there is also triple information (actor E, couple, singer a); for music 4, singer B and singer C also performed, there are also triplet information (singer B, singing, music 4) and (singer C, singing, music 4).

It should be understood that fig. 4 is only a semantic knowledge graph corresponding to the triplet information of the above example, and is a schematic diagram of a local semantic knowledge graph in a human-computer interaction scene of playing music through an artificial intelligence, and the semantic knowledge graph may further include other triplet information.

In this embodiment, a plurality of triple information is obtained by obtaining corpora in a plurality of human-computer interaction scenes and extracting entities, relationships and attributes from the corpora. And further, constructing a semantic knowledge graph according to the triple information. Therefore, the constructed semantic knowledge graph contains the entities in the voice texts in various human-computer interaction scenes and the relations among the entities, and data support can be provided for learning of the semantic recognition model.

In addition, in order to achieve a desired recognition accuracy, the semantic recognition model needs to be learned by using a very large amount of labeled data. Therefore, the method and the device can obtain massive large sample labeling data by expanding and processing the small sample labeling data with limited quantity, and provide data support for learning of the semantic recognition model. In a possible implementation manner, the small sample labeling data can be expanded through the constructed semantic knowledge graph to obtain large sample labeling data, so that the purpose of labeling data enhancement is achieved. The data expansion processing comprises data transformation processing and data addition and deletion processing.

It should be noted that, data enhancement is used as a data preprocessing method, and is widely applied in the field of computer vision, for example, rotation, clipping, flipping, translation and the like are performed on an image sample, and the generalization capability of a model can be effectively improved by using the data enhancement, and the demand of labeling data is reduced. However, in the field of natural language processing, there are fewer methods for data enhancement, and different methods for data enhancement need to be designed for different tasks.

In one embodiment, as shown in fig. 5, the present application provides a data enhancement method, which is described by taking the method as an example for the artificial intelligence in fig. 1, and includes the following steps:

step 510: and acquiring small sample labeling data in various human-computer interaction scenes.

Each piece of labeled data in the small sample labeled data comprises a piece of voice text and corresponding semantic information.

It should be noted that, in the same interaction scenario, semantic information in the human-computer interaction process may be the same, but expression modes and/or expression habits of different users are different, and for the same semantic, speech texts that are delivered to the artificial intelligence by different users are different.

The small sample labeling data can be sample labeling data which is manually labeled and stored in a database after the artificial intelligence collects voice texts in various human-computer interaction scenes, or labeled sample labeling data which is directly acquired from other third-party systems or databases, and the source of the small sample labeling data is not limited by the embodiment of the application.

Step 520: and carrying out data transformation processing on the small sample labeling data to obtain expansion sample data.

In one possible implementation manner, the implementation process of step 520 may be: determining a target entity in each small sample marking data and entity attributes corresponding to the target entity; acquiring entities of the same type of the target entity from the semantic knowledge graph according to the entity attribute corresponding to the target entity; and replacing the target entity in the small sample labeling data with the same type entity to obtain the expansion sample data.

The target entity comprises at least one entity, and the same type entity comprises a plurality of entities. In other words, based on one target entity, a plurality of entities of the same type can be obtained from the semantic knowledge graph, and the entity attributes of the target entity and the entities of the same type are the same.

As an example, as shown in fig. 6, the voice text in the obtained annotation data is: the A work of the LaiShouwang XX has semantic information as follows: and playing music. The first target entity in the annotation data is "wang XX", and the second target entity is "a work". Moreover, the attribute of the entity 'king XX' is singer, the attribute of the entity a work is music, and a singing relationship exists between the first target entity and the second entity.

Based on the first target entity, the same type of entities with attribute of "singer" are obtained in the semantic knowledge graph, including but not limited to: weekxx, Zhang X, Chua XX and Lin XX. For each of the same type of entities of the first target entity, obtaining the same type of entities with the attribute "music" in the knowledge-graph includes, but is not limited to: works B, works C, works D and works E.

Further, in a possible implementation manner, the process of replacing the target entity in the small sample labeling data with the same type of entity may be: acquiring position information of a target entity in the small sample marking data; and replacing the target entity in the small sample labeling data with the same type entity according to the position information.

Referring to fig. 6, the "wang XX" in the phonetic text is replaced by: and the works A are sequentially replaced by works B, works C, works D and works E.

It should be noted that, in the replacement process, the singing relationship existing between the first target entity and the second target entity should be kept unchanged. For example, in the above example, when the first target entity is replaced by "XX", the second target entity can only use B work for replacement, so as to ensure the accuracy of the generated large sample annotation data after replacement.

Based on the above example, after the small sample annotation data "the work a of the seiko XX" is subjected to data change processing, the obtained extended sample data includes: work B from the first week XX; the D work from the first Chua XX, the C work from the first X, and the E work from the first Lin XX.

In another possible implementation manner, the process of replacing the target entity in the small sample annotation data with the same type of entity may be: and classifying the small sample labeled data according to semantic information corresponding to the voice text in the small sample labeled data to obtain multiple types of labeled data, wherein the semantic information of each type of labeled data is the same. Then, for a plurality of voice texts with the same semantic information, determining a target entity and attributes of the target entity included in each voice text, vacating the position of the target entity, and marking the attributes of the entity to be filled at the vacated position, so as to obtain a plurality of sentence templates under the voice text. Furthermore, according to the attributes of the positions of the fillable entities marked in each sentence pattern template and the entity relationship among the fillable entities, a plurality of entities of the same type meeting the requirements are obtained from the semantic knowledge graph, and the entities of the same type are filled in each sentence pattern template in sequence, so that the expansion sample data can be obtained.

As an example, assume that semantic information in the small sample annotation data is: playing music, wherein the labeled voice texts under the semantic information comprise the following parts:

(1) please play the works B of week XX, just?

(2) The first C work is heard, namely that of X.

(3) The song in the Y work collection of forest XX comes randomly.

Wherein, the target entity in the (1) th voice text is: and the XX and B works have the following attributes: singer and music, the first sentence pattern template thus obtained is: please play [ music ], how could? .

Similarly, the target entities in the (2) th speech text are: c, the product and the piece X have the following attributes: music and singer, the second sentence pattern template thus obtained is: the first one (music) is heard, namely the first one (singer). The target entities in the (3) th speech text are: forest XX and Y works set, the attribute is respectively: singer and album, the third phonetic sentence pattern thus obtained is: the song in album (singer) comes at will.

Further, as shown in fig. 7, the first homogeneous entity with the attribute of "singer" obtained from the semantic knowledge graph according to the entity attribute of the target entity in the first sentence pattern template and the relationship between the target entities may be: king XX, zhangxi, zai XX and lin XX. Then, according to the singing relationship existing between the singer and the music, the song sung by the king XX, the song sung by the piece X, the song sung by the forest XX, and the song sung by the cai XX are obtained, and the second same type entity with the music attribute can be obtained as follows: work A, work C, work D, and work E. And finally, sequentially filling the entities of the same type into the position marked with the attribute of the singer in the first sentence pattern template, and sequentially filling the entities of the same type into the position marked with the attribute of the music in the first sentence pattern template to obtain the extended sample data.

Thus, obtaining the extended sample data through the first sentence pattern template comprises: please play the C work of piece X, just? Please play the works of forest XX? Please play the D work of the chase XX, do you get? Please play the a work of wang XX, so? .

Step 530: and performing data addition and deletion processing on the expansion sample data to obtain large sample labeling data.

For the extended sample data, the adding and deleting process includes adding text content to the voice text and deleting the text content of the voice text.

In one possible implementation manner, the implementation process of step 530 may be: based on a preset prefix and suffix database, randomly adding a prefix and/or a suffix to the expansion sample data to obtain first expansion data; and performing language word adding and deleting operation on the first expansion data to obtain large sample labeling data.

The prefix and suffix database comprises a prefix library and a suffix library, wherein the prefix library comprises a plurality of words which can be added to the head end of the voice text, such as: handsome boy, beautiful girl, hello, you so, on. The suffix library includes a plurality of words that can be added to the end of the speech text, such as: douduo, can, ok, ask you, bai tuo.

As one example, the linguistic words in the phonetic text include: kaihe, wen, woo, uhao, etc. do not affect the content but may involve nonsense words that the person expresses habits and ways.

In the embodiment of the application, based on the constructed semantic knowledge graph, the large sample labeling data is obtained by performing data change processing and data addition and deletion processing on the small sample labeling data. Therefore, massive large-sample labeling data can be automatically obtained through the labeled small-sample labeling data and the labeled semantic knowledge map, the manual labeling cost is reduced, and the data labeling speed and accuracy are improved.

Based on the embodiment shown in fig. 5, as shown in fig. 8, the present application further provides another data enhancement method, which is described by taking the application of the method to the artificial intelligence agent in fig. 1 as an example, and includes the following steps:

step 810: acquiring small sample labeling data in each man-machine interaction scene;

step 820: determining a target entity in each small sample marking data and entity attributes corresponding to the target entity; the target entity comprises at least one entity;

step 830: acquiring entities of the same type of the target entity from the semantic knowledge graph according to the entity attribute corresponding to the target entity;

step 840: acquiring the position information of a target entity in the small sample labeling data;

step 850: replacing the target entity in the small sample marking data with the same type entity according to the position information to obtain the expansion sample data;

step 860: based on a preset prefix and suffix database, randomly adding a prefix and/or a suffix to the expansion sample data to obtain first expansion data;

step 870: and performing language word addition and deletion operation on the first expansion data to obtain large sample labeling data.

The implementation principle and technical effect of each step in the data enhancement method provided in this embodiment are similar to those of the method embodiment shown in fig. 5, and are not described herein again.

Based on the data enhancement method shown in the above embodiment, the small sample labeling data is expanded to obtain a large amount of large sample labeling data. Further, the initial semantic recognition model is trained by adopting large sample labeling data, and the semantic recognition model of the application is obtained. Therefore, after the semantic recognition model is obtained, the artificial intelligence body can apply the semantic recognition model to accurately recognize and respond the target voice information.

In one embodiment, as shown in fig. 9, the determining the implementation process of the response information according to the target speech information and the semantic information in step 220 includes the following steps:

step 910: and determining the reference entity in the target voice information and the attribute of the reference entity.

The reference entity is at least one entity in the target voice information, and the attribute is a vocabulary for describing the characteristics of the entity from different dimensions.

As an example, if the target voice information is: playing the Zhougelon order-by-order listening, the artificial intelligence body identifies reference entities from the artificial intelligence body based on the semantic identification model, and the reference entities comprise: zhou Jie Lun and Bu Chao coming, and attributes are "singer" and "music".

Step 920: and acquiring at least one candidate entity corresponding to the reference entity from the semantic knowledge graph according to the attribute of the reference entity.

It should be noted that the candidate entity obtained from the semantic knowledge graph may be the same as or different from the reference entity, and this embodiment is intended to illustrate that the attributes of the reference entity and the candidate entity are the same, and it is not limited that the content and the text are the same.

Step 930: and determining the entity with the highest similarity with the reference entity in the at least one candidate entity as the standard entity.

In this step, if the candidate entity is the same as the reference entity, the reference entity is taken as the standard entity; and if the candidate entity is different from the reference entity, determining the candidate entity with the highest similarity as the standard entity according to the similarity between the candidate entity and the reference entity.

That is, in the embodiment of the present application, the semantic knowledge graph is used as a standard, the error correction is performed on the recognition result of the semantic recognition model in the artificial intelligence body through the candidate entity in the semantic knowledge graph, and when the two are different, the candidate entity with the highest similarity in the semantic knowledge graph is used as a standard entity, and the reference entity recognized by the semantic recognition model is ignored.

Based on the above example, the target voice information is: the XX is played for listening by yyy, the reference entity is "yyy from", and the attribute is "music". However, if the music sung in XX week is queried from the semantic knowledge map, and the obtained candidate entity is "yyy", then "yyy" is finally determined as the standard entity.

Step 940: and determining response information according to the semantic information of the standard entity and the target voice information.

Based on the above example, the target speech information is: if the yey of the XX is played for listening, and the semantic information is music, the response information determined by the artificial intelligence may be: good, this is to play song yyy of week XX for you.

In this embodiment, the artificial intelligence identifies the target voice information through the semantic identification model, and determines the reference entity, the attributes of the reference entity, and the semantic information of the target voice information. And then, acquiring at least one candidate entity from the semantic knowledge graph according to the entity attribute of the reference entity, and correcting the error of the reference entity identified by the artificial intelligence body by adopting the candidate entity so as to determine a standard entity from the reference entity and the candidate entity. Furthermore, according to the semantic information of the standard entity and the target voice information, response information which is more in line with the user interaction expectation is determined, so that the man-machine interaction effect is better.

Based on the embodiment, the recognition result of the semantic recognition model needs to be corrected by using the semantic knowledge graph so as to improve the recognition accuracy of the semantic recognition model. Therefore, with the appearance of new corpora, the constructed semantic knowledge graph needs to be updated so as to provide more reference information for the semantic recognition model.

In one embodiment, the implementation process of updating the semantic knowledge graph may be: and updating the semantic knowledge graph according to the linguistic data in each man-machine interaction scene at the current time according to a preset period.

In a possible implementation mode, the linguistic data in each man-machine interaction scene are acquired periodically, and the semantic knowledge graph is updated in a full amount or an incremental amount according to triple information in the linguistic data at the current moment, so that the semantic knowledge graph can supplement more entities on the basis of the original basis, and the relationship among the entities is enriched.

Optionally, after the semantic knowledge graph is updated, the updated semantic knowledge graph can be used to further expand the original large labeled data, and the semantic recognition model is retrained by using the expanded labeled data.

In the embodiment of the application, by updating the semantic knowledge spectrogram, on one hand, more training samples can be provided for the semantic recognition model. On the other hand, the entity recognition result of the semantic recognition model is corrected by adopting the entity in the updated semantic knowledge map, and the recognition accuracy of the semantic recognition model can be ensured by updating the semantic knowledge map under the condition of reducing the training times of the semantic recognition model.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In addition, an embodiment of the present application further provides a speech information recognition method, and an implementation scheme for solving the problem provided by the method is similar to an implementation process of performing speech information recognition by the artificial intelligence, so specific limitations in one or more embodiments of the speech information recognition method provided below may refer to limitations of implementation steps when a processor in the artificial intelligence invokes and executes computer-readable instructions, which is not described herein again.

It should be noted that the speech information recognition method provided by the present application may be executed by an artificial intelligence agent, and may also be implemented by other computer devices, and the present application does not limit the execution subject, and any computer device that performs speech recognition according to the speech information recognition method shown in the present application, or implements human-computer speech dialogue interaction is within the scope of the present application.

The voice information recognition method comprises the following steps:

the semantic identification model is obtained by training large sample labeling data, and the large sample labeling data is obtained by performing data expansion processing on small sample labeling data through a pre-constructed semantic knowledge graph; the semantic knowledge graph comprises corpora in various human-computer interaction scenes;

In one embodiment, the method further comprises:

obtaining corpora in various human-computer interaction scenes;

In one embodiment, the data augmentation process comprises: data transformation processing and data addition and deletion processing; the method further comprises the following steps:

acquiring small sample labeling data in each man-machine interaction scene;

and carrying out data addition and deletion processing on the expansion sample data to obtain large sample labeling data.

In one embodiment, the target entity in the small sample labeling data is replaced by the same type of entity, including:

acquiring position information of a target entity in the small sample marking data;

In one embodiment, the data adding and deleting processing is performed on the expansion sample data to obtain large sample labeling data, and the method includes:

based on a preset prefix and prefix database, randomly adding a prefix and/or a suffix to the expansion sample data to obtain first expansion data;

determining an entity with the highest similarity with the reference entity in at least one candidate entity as a standard entity;

and determining the response information according to the semantic information of the standard entity and the target voice information.

In one embodiment, the method further comprises:

The implementation principle and technical effect of the speech information recognition method provided by this embodiment are similar to the steps executed by the artificial intelligence, and are not described herein again.

Based on the same inventive concept, the embodiment of the application also provides a voice information recognition device for realizing the voice information recognition method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the speech information recognition device provided below can be referred to the limitations of the speech information recognition method in the above, and details are not described herein again.

In one embodiment, as shown in fig. 10, there is provided a voice information recognition apparatus, the apparatus 1000 including: a semantic recognition module 1010 and a response module 1020, wherein:

the semantic recognition module 1010 is used for responding to the target voice information and recognizing the semantic information of the target voice information through a preset semantic recognition model;

and a response module 1020, configured to determine response information according to the target voice information and the semantic information, and output the response information.

In one embodiment, the apparatus 1000 further comprises:

the first acquisition module is used for acquiring corpora in various human-computer interaction scenes;

the extraction module is used for extracting corresponding triple information from the linguistic data in each man-machine interaction scene; the triple information comprises entities, relations and attributes;

and the construction module is used for constructing the semantic knowledge graph according to the triple information.

In one embodiment, the data augmentation process comprises: data transformation processing and data addition and deletion processing; the apparatus 1000 further comprises:

the second acquisition module is used for acquiring small sample labeling data in each man-machine interaction scene;

the data transformation module is used for carrying out data transformation processing on the small sample labeling data to obtain expanded sample data;

and the data adding and deleting module is used for performing data adding and deleting processing on the expansion sample data to obtain the large sample labeling data.

In one embodiment, the data transformation module comprises:

the first determining unit is used for determining a target entity and entity attributes corresponding to the target entity in the small sample labeling data; the target entity comprises at least one entity;

the first acquisition unit is used for acquiring entities of the same type of the target entity from the semantic knowledge graph according to the entity attribute corresponding to the target entity;

and the replacing unit is used for replacing the target entity in the small sample marking data with the same type entity to obtain the expansion sample data.

In one embodiment, the replacement unit includes:

the acquisition subunit is used for acquiring the position information of the target entity in the small sample labeling data;

and the replacing subunit is used for replacing the target entity in the small sample labeling data with the entity of the same type according to the position information.

In one embodiment, the data adding and deleting module includes:

the adding unit is used for randomly adding prefixes and/or suffixes to the expansion sample data based on a preset prefix-suffix database to obtain first expansion data;

and the adding and deleting unit is used for carrying out the operation of adding and deleting the linguistic atmosphere words on the first expansion data to obtain the large sample labeling data.

In one embodiment, the response module 1020 includes:

the second determining unit is used for determining the reference entity in the target voice information and the attribute of the reference entity;

the second acquisition unit is used for acquiring at least one candidate entity corresponding to the reference entity from the semantic knowledge graph according to the attribute of the reference entity;

a third determining unit, configured to determine an entity with a highest similarity to the reference entity in the at least one candidate entity as a standard entity;

and the fourth determining unit is used for determining the response information according to the standard entity and the semantic information of the target voice information.

In one embodiment, the apparatus 1000 further comprises:

and the updating module is used for updating the semantic knowledge graph according to the preset period and according to the corpora in each man-machine interaction scene at the current moment.

The modules in the voice information recognition device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an artificial intelligence agent is provided, which can be any intelligent agent or intelligent terminal, and the internal structure diagram of the artificial intelligence agent can be as shown in fig. 11. The artificial intelligence body comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the artificial intelligence agent is configured to provide computing and control capabilities. The memory of the artificial intelligence body comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the artificial intelligence body is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech information recognition method. The display screen of the artificial intelligence body can be a liquid crystal display screen or an electronic ink display screen, and the input device of the artificial intelligence body can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the artificial intelligence body, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configuration shown in FIG. 11 is a block diagram of only a portion of the configuration associated with the disclosed aspects and is not intended to limit the artificial intelligence to which the disclosed aspects apply, and that a particular artificial intelligence may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

The implementation principle and technical effect of the computer device provided by the above embodiment are similar to those of the above method embodiment, and are not described herein again.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

The foregoing embodiments provide a computer program product, which has similar implementation principles and technical effects to those of the foregoing method embodiments, and will not be described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. An artificial intelligence agent, comprising a memory having a computer program stored therein and a processor configured to implement the following steps when the computer program is invoked and executed:

responding to target voice information, and identifying semantic information of the target voice information through a preset semantic identification model;

2. An artificial intelligence agent according to claim 1 wherein the processor is further adapted to perform the following steps when invoking and executing the computer program:

obtaining corpora in various human-computer interaction scenes;

extracting corresponding triple information from the linguistic data in each human-computer interaction scene; the triple information comprises entities, relations and attributes;

and constructing the semantic knowledge graph according to the triple information.

3. An artificial intelligence agent according to claim 1 or 2, wherein the data augmentation process comprises: data transformation processing and data addition and deletion processing;

the processor is used for calling and executing the computer program and further realizing the following steps:

acquiring small sample labeling data in each human-computer interaction scene;

performing the data transformation processing on the small sample marking data to obtain expansion sample data;

and performing data addition and deletion processing on the extended sample data to obtain the large sample marking data.

4. The artificial intelligence agent of claim 3, wherein said performing the data transformation process on the small sample labeled data to obtain extended sample data comprises:

and replacing the target entity in the small sample labeling data with the entity of the same type to obtain the expansion sample data.

5. The artificial intelligence agent of claim 4, wherein the replacing the target entity in the small sample annotation data with the same type of entity comprises:

acquiring the position information of the target entity in the small sample marking data;

and replacing the target entity in the small sample labeling data with the entity of the same type according to the position information.

6. The artificial intelligence of claim 3 wherein said data addition and deletion processing on said extended sample data to obtain said large sample tagged data comprises:

and performing language word adding and deleting operation on the first expansion data to obtain the large sample labeling data.

7. The artificial intelligence agent of claim 1 or 2 wherein said determining response information based on said target speech information and said semantic information comprises:

determining a reference entity in the target voice information and the attribute of the reference entity;

determining an entity with the highest similarity to the reference entity in the at least one candidate entity as a standard entity;

8. An artificial intelligence agent as claimed in claim 1 or 2, wherein the processor is further arranged to effect the following steps when invoking and executing the computer program:

and updating the semantic knowledge graph according to the corpora in each man-machine interaction scene at the current moment according to a preset period.

9. A method for recognizing speech information, the method comprising: the artificial intelligence implementation of any one of claims 1 to 8.

10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program is invoked by a processor and implements the steps implemented by the artificial intelligence of any of claims 1-8.

11. A computer program product comprising a memory and a processor, the memory having stored thereon a computer program, wherein the computer program is invoked by the processor and performs the steps performed by the artificial intelligence of any one of claims 1 to 8.