CN113889117A

CN113889117A - Voice cross-correlation system, method, device and equipment

Info

Publication number: CN113889117A
Application number: CN202010628897.8A
Authority: CN
Inventors: 曹涌; 聂再清; 周晓欢; 王鹏伟; 谢静辉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2022-01-04

Abstract

The application discloses a voice cross-correlation system, a method, a device and equipment. The voice interaction system comprises a voice interaction system terminal device, a server and a server, wherein the voice interaction system terminal device collects voice data and sends the voice data to the server; the server side constructs an entity knowledge base, and entity information in the voice data is determined through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information. By adopting the processing mode, the entity knowledge map information is introduced, whether entity pronunciation in the knowledge map exists in the voice is directly compared, semantic understanding and entity recognition are realized from the voice signal, and the process is closer to the process of understanding the voice by human; therefore, the accuracy of voice entity recognition can be effectively improved, and the voice interaction accuracy is improved.

Description

Voice cross-correlation system, method, device and equipment

Technical Field

The application relates to the technical field of voice recognition, in particular to a multimedia program on-demand system, a multimedia program on-demand method and a multimedia program on-demand device, an ordering system, a multimedia program on-demand method and a multimedia program on-demand device, a communication connection establishing system, a communication connection establishing method and a multimedia connection establishing device, a voice interaction system, a voice interaction method and a voice entity recognition model establishing device, an entity knowledge base establishing method and a multimedia communication entity knowledge base establishing device, a television program on-demand method and a television program on-demand device, a conference recording method and a conference recording device, an intelligent sound box, an intelligent television, an ordering machine, user equipment and electronic equipment.

Background

With the continuous development of Automatic Speech Recognition (ASR) technology, intelligent Speech assistants are widely introduced, such as intelligent Speech assistant services provided by smart phones to users, intelligent speakers, and the like.

The core function of artificial intelligence is the starting point of the voice assistant, so that the voice assistant can better understand the user instruction, particularly entities with specific meanings in the user instruction, mainly including name of person, place, organization, song, movie, telephone number, proper noun, and the like. Take intelligent audio amplifier as an example, the user can use the song requesting service that the audio amplifier provided, if the user says: i want to listen to the memorial idea of thunderstorm heart, wherein the thunderstorm heart and the memorial idea are entities with specific meanings and are processing objects of song ordering instructions, and if the two entities cannot be identified correctly, the songs ordered by the user cannot be played correctly. At present, a typical speech entity recognition system processes the following: firstly, converting an input voice signal into characters through a voice recognition technology ASR; then, through semantic understanding of the words, the entity name in the user instruction is identified.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: the scheme heavily depends on the text content output by the upstream ASR, and if the situation that the pronunciation of a user is inaccurate (such as accent or partial pronunciation errors) or unclear is met, the probability that the ASR converts a speech signal into the wrong text content is high, so that the entity name in the speech signal cannot be correctly recognized. In summary, how to improve the accuracy of speech entity recognition, and thus improve the accuracy of speech interaction, is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

The application provides a multimedia program on demand system, which aims to solve the problem of low accuracy of speech entity recognition caused by inaccurate pronunciation, unclear pronunciation or different homophones of a user in the prior art. The application further provides a multimedia program on demand method and device, a food ordering system, method and device, a communication connection establishing system, method and device, a voice interaction system, method and device, a voice entity recognition model establishing method and device, an entity knowledge base establishing method and device, a television program on demand method and device, a conference recording method and device, an intelligent sound box, an intelligent television, a food ordering machine, user equipment and electronic equipment.

The application provides a multimedia program on demand system, including:

the intelligent sound box is used for collecting multimedia program on-demand voice data and sending the voice data to the server; playing the multimedia program according to the multimedia program playing processing result of the server;

the server is used for constructing a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

The present application further provides an ordering system, comprising:

the ordering device is used for acquiring ordering voice data and sending the voice data to the server;

the server is used for constructing a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

The present application further provides a communication connection establishing system, including:

the user equipment is used for acquiring communication instruction voice data and sending the voice data to the server;

the server is used for constructing a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

The present application further provides a voice interaction system, comprising:

the terminal equipment is used for acquiring voice data and sending the voice data to the server;

the server is used for constructing an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

The application also provides a voice interaction method, which comprises the following steps:

constructing an entity knowledge base;

determining entity information in the target voice data through a voice entity recognition model and the entity knowledge base;

and executing voice interaction processing according to the entity information.

Optionally, the determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base includes:

determining audio characteristic data of the voice data through an audio coding model included in the voice entity recognition model;

and determining the entity information according to the audio characteristic data through an entity decoding model included in the voice entity recognition model and the entity knowledge base.

Optionally, the determining the entity information according to the audio feature data by using the entity decoding model and the entity knowledge base included in the speech entity recognition model includes:

determining at least one candidate pronunciation of the entity information according to the audio characteristic data through an entity candidate pronunciation determination module included in the entity decoding model;

determining, by an entity pronunciation determination module included in the entity decoding model, a pronunciation of the entity information from the at least one candidate pronunciation according to the entity knowledge base;

and determining the entity information according to the pronunciation of the entity information.

Optionally, the determining the pronunciation of the entity information from the at least one candidate pronunciation according to the entity knowledge base includes:

determining similarity of the pronunciation of the entity in the entity knowledge base and the candidate pronunciation;

and determining the pronunciation of the entity information according to the similarity.

Optionally, the entity knowledge base includes: a program entity knowledge base in the multimedia program on demand field;

the program entity knowledge base comprises: the entity relation among the program related entities, the user entities and the user entities of homophone different characters;

the building of the entity knowledge base comprises the following steps:

determining the user entity according to the historical playing information of the user, and constructing the entity relationship;

the determining the entity information according to the pronunciation of the entity information includes:

determining candidate entities according to the pronunciations of the entity information;

and determining the entity information from the candidate entities according to the user information and the entity relationship.

Optionally, the method further includes:

learning from training data to obtain the speech entity recognition model;

wherein the training data comprises: audio data and entity tagging information.

determining pronunciation characteristic data of an entity in the entity knowledge base through an entity coding model included in the voice entity recognition model;

and determining the entity information according to the audio characteristic data and the entity pronunciation characteristic data through an entity decoding model included in the voice entity recognition model.

Optionally, the determining the entity information according to the audio feature data and the entity pronunciation feature data by the entity decoding model included in the speech entity recognition model includes:

determining pronunciation similarity between an entity in the voice data and an entity in the entity knowledge base according to the audio characteristic data and the entity pronunciation characteristic data;

and determining the entity information according to the pronunciation similarity.

the building of the entity knowledge base comprises the following steps:

the determining the entity information according to the pronunciation similarity includes:

determining candidate entities according to the pronunciation similarity;

Optionally, the method further includes:

learning from training data to obtain the speech entity recognition model;

wherein the training data comprises: audio data, an entity knowledge base, and entity tagging information.

the building of the entity knowledge base comprises the following steps:

and determining related entities of the multimedia program to form the entity knowledge base.

collecting voice data, and sending the voice data to a server so that the server constructs an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

The application also provides a multimedia program on demand method, which comprises the following steps:

constructing a multimedia program knowledge base;

determining multimedia program information in the multimedia program on demand voice data through a voice entity recognition model and the knowledge base;

and executing multimedia program playing processing according to the multimedia program information.

collecting multimedia program on-demand voice data, and sending the voice data to a server side so that the server side can construct a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

The application also provides a food ordering method, which comprises the following steps:

constructing a food knowledge base;

determining food information in the food ordering voice data through a voice entity recognition model and the entity knowledge base;

and executing food preparation processing according to the food information.

collecting ordering voice data, and sending the voice data to a server side so that the server side can construct a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

The application also provides a communication connection establishing method, which comprises the following steps:

constructing a communication user knowledge base;

determining communication user information in communication instruction voice data through a voice entity recognition model and the knowledge base;

and executing communication connection establishment processing according to the communication user information.

collecting communication instruction voice data, and sending the voice data to a server side so that the server side can construct a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

The application also provides a method for constructing the speech entity recognition model, which comprises the following steps:

determining a training data set, the training data comprising: voice data, entity tagging information and an entity knowledge base;

constructing a network structure of the model;

the model is learned from a training data set.

Optionally, the model comprises an audio coding model for determining audio characteristic data of the speech data;

the model comprises an entity decoding model, and entity information in the voice data is determined according to the audio characteristic data and the entity knowledge base.

the model comprises an entity coding model used for determining pronunciation characteristic data of the entity in the entity knowledge base;

the model comprises an entity decoding model used for determining entity information in the voice data according to the audio characteristic data and the entity pronunciation characteristic data.

The application also provides a method for constructing the entity knowledge base, which comprises the following steps:

acquiring an entity name of a target field;

and generating an entity knowledge base of the target field according to the entity name, wherein the entity knowledge base is used for determining entity information in the voice data of the target field through a voice entity recognition model and the entity knowledge base.

The application also provides a voice entity recognition method, which comprises the following steps:

constructing an entity knowledge base and a voice entity recognition model;

determining target voice data;

and determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base.

The present application further provides a voice interaction apparatus, including:

the knowledge base construction unit is used for constructing an entity knowledge base;

the entity determining unit is used for determining entity information in the target voice data through a voice entity recognition model and the entity knowledge base;

and the interactive processing unit is used for executing voice interactive processing according to the entity information.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: constructing an entity knowledge base; determining entity information in the target voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

the voice data acquisition unit is used for acquiring voice data;

the voice data sending unit is used for sending the voice data to the server so that the server constructs an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: collecting voice data, and sending the voice data to a server so that the server constructs an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

The present application further provides a multimedia program on demand device, comprising:

the voice data acquisition unit is used for acquiring multimedia program on-demand voice data;

the voice data sending unit is used for sending the voice data to the server so that the server can construct a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

The application further provides an intelligent sound box, include:

a processor; and

a memory for storing a program for implementing the method for requesting a multimedia program, the device being powered on and running the program for the method via the processor for performing the following steps: collecting multimedia program on-demand voice data, and sending the voice data to a server side so that the server side can construct a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

the knowledge base construction unit is used for constructing a multimedia program knowledge base;

the entity recognition unit is used for determining the multimedia program information in the multimedia program on-demand voice data through the voice entity recognition model and the knowledge base;

and the program playing processing unit is used for executing the multimedia program playing processing according to the multimedia program information.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the method for requesting a multimedia program, the device being powered on and running the program for the method via the processor for performing the following steps: constructing a multimedia program knowledge base; determining multimedia program information in the multimedia program on demand voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

The present application further provides an ordering device, comprising:

the knowledge base construction unit is used for constructing a food knowledge base;

the entity recognition unit is used for determining food information in the food ordering voice data through a voice entity recognition model and the entity knowledge base;

and the meal preparation processing unit is used for executing meal preparation processing according to the meal information.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the ordering method, wherein the device executes the following steps after being powered on and running the program of the method through the processor: constructing a food knowledge base; determining food information in the food ordering voice data through a voice entity recognition model and the entity knowledge base; and executing food preparation processing according to the food information.

The present application further provides an ordering device, comprising:

the voice data acquisition unit is used for acquiring ordering voice data;

the voice data sending unit is used for sending the voice data to the server so that the server can construct a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

The application also provides an ordering machine, including:

a processor; and

a memory for storing a program for implementing the ordering method, wherein the device executes the following steps after being powered on and running the program of the method through the processor: collecting ordering voice data, and sending the voice data to a server side so that the server side can construct a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

The present application further provides a communication connection establishing apparatus, including:

the voice data acquisition unit is used for acquiring communication instruction voice data;

the voice data sending unit is used for sending the voice data to the server so that the server can construct a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

The present application further provides a user equipment, comprising:

a processor; and

a memory for storing a program for implementing the communication connection establishment method, wherein the following steps are executed after the device is powered on and the program for implementing the method is run by the processor: collecting communication instruction voice data, and sending the voice data to a server side so that the server side can construct a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

the knowledge base construction unit is used for constructing a communication user knowledge base;

the entity recognition unit is used for determining communication user information in the communication instruction voice data through a voice entity recognition model and the knowledge base;

and the communication connection processing unit is used for executing communication connection establishment processing according to the communication user information.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the ordering method, wherein the device executes the following steps after being powered on and running the program of the method through the processor: constructing a communication user knowledge base; determining communication user information in communication instruction voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

The present application further provides a speech entity recognition model building apparatus, including:

a training data determination unit for determining a training data set, the training data comprising: voice data, entity tagging information and an entity knowledge base;

the network construction unit is used for constructing a network structure of the model;

and the model training unit is used for learning the model from the training data set.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a speech entity recognition model building method, the apparatus performing the following steps after being powered on and running the program of the method through the processor: determining a training data set, the training data comprising: voice data, entity tagging information and an entity knowledge base; constructing a network structure of the model; the model is learned from a training data set.

The present application further provides an entity knowledge base constructing apparatus, including:

the entity determining unit is used for acquiring an entity name of the target field;

and the knowledge base generation unit is used for generating an entity knowledge base of the target field, and the entity knowledge base is used for determining entity information in the voice data of the target field through the voice entity recognition model and the entity knowledge base.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the entity knowledge base construction method, the device executing the following steps after being powered on and running the program of the method through the processor: acquiring an entity name of a target field; and generating an entity knowledge base of the target field according to the entity name, wherein the entity knowledge base is used for determining entity information in the voice data of the target field through a voice entity recognition model and the entity knowledge base.

The present application further provides a speech entity recognition apparatus, including:

the model building unit is used for recognizing the model by the voice entity;

a voice data determination unit for determining target voice data;

and the entity determining unit is used for determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a speech entity recognition method, the device performing the following steps after being powered on and running the program of the method by the processor: constructing an entity knowledge base and a voice entity recognition model; determining target voice data; and determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base.

The application also provides a television program playing method, which comprises the following steps:

building a television program knowledge base;

determining a target program name corresponding to the target program playing voice instruction data through a voice entity recognition model and the knowledge base;

and executing target program object playing processing according to the target program name.

Optionally, the knowledge base includes: the entity relation among the program related entities, the user entities and the user entities of homophone different characters;

the building of the television program knowledge base comprises the following steps:

and determining the user entity according to the historical playing information of the user, and constructing the entity relationship.

Optionally, the executing, according to the target program name, target program object playing processing includes:

determining a television channel and playing time corresponding to the target program name according to a program list;

determining a target program object according to the playing time and the television channel;

and executing the processing of playing the target program object.

Optionally, the determining a target program object according to the playing time and the television channel includes:

displaying a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times;

and taking the program object specified by the user as a target program object.

Optionally, the method further includes:

if the program list does not comprise the target program name, determining the program name related to the target program name;

displaying the related program name;

and if the user designates to play the related program object, executing the processing of playing the related program object.

the intelligent television collects program playing voice instruction data of a user;

sending the voice instruction data to a server so that the server can conveniently construct a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name;

and playing the target program object.

The application also provides a conference recording method, which comprises the following steps:

constructing a language knowledge base of the conference field;

determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field;

and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

Optionally, the method further includes:

and determining a conference field corresponding to the conference voice data.

collecting voice data of a target conference;

sending the voice data to a server side so that the server side can conveniently construct a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

The present application further provides a television program playing device, including:

the voice data acquisition unit is used for acquiring program playing voice instruction data of a user;

the voice data sending unit is used for sending the voice instruction data to the server so as to facilitate the server to construct a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name;

and the program playing unit is used for playing the target program object.

The application also provides a smart television, including:

a processor; and

a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: collecting program playing voice instruction data of a user; sending the voice instruction data to a server so that the server can conveniently construct a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name; and playing the target program object.

the knowledge base construction unit is used for constructing a television program knowledge base;

the entity recognition unit is used for determining a target program name corresponding to the target program playing voice instruction data through a voice entity recognition model and the knowledge base;

and the playing processing unit is used for executing the playing processing of the target program object according to the target program name.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: building a television program knowledge base; determining a target program name corresponding to the target program playing voice instruction data through a voice entity recognition model and the knowledge base; and executing target program object playing processing according to the target program name.

The present application further provides a conference recording apparatus, including:

the voice data acquisition unit is used for acquiring voice data of the target conference;

the voice data sending unit is used for sending the voice data to the server so that the server can conveniently construct a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: collecting voice data of a target conference; sending the voice data to a server side so that the server side can conveniently construct a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

the knowledge base construction unit is used for constructing a language knowledge base of the conference field;

the entity recognition unit is used for determining entity information in the target conference voice data through a voice entity recognition model and a language knowledge base of the conference field;

and the conference record determining unit is used for determining a text sequence corresponding to the conference voice data through the voice recognition model and the entity information to form a conference record.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: constructing a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

according to the multimedia program on demand system provided by the embodiment of the application, the voice data are sent to the server side through the multimedia program on demand voice data of the intelligent sound box; playing the multimedia program according to the multimedia program playing processing result of the server; the server is used for constructing a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; executing multimedia program playing processing according to the multimedia program information; the processing mode leads to the introduction of the knowledge map information of the multimedia program, directly compares whether the multimedia program entity in the knowledge map pronounces in the multimedia program on-demand voice, realizes semantic understanding and entity recognition from the voice signal, does not depend on ASR to convert the voice into characters, and recognizes the multimedia program entity name in the words through the semantic understanding of the characters, thus being closer to the process of understanding the voice by human beings; therefore, the accuracy of identifying the multimedia program name can be effectively improved, and the success rate and the accuracy of ordering the multimedia program are improved.

According to the ordering system provided by the embodiment of the application, ordering voice data are collected through ordering equipment, and the voice data are sent to a server; the server side constructs a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; executing food preparation processing according to the food information; by adopting the processing mode, the food knowledge map information is introduced, whether the food name pronunciation in the knowledge map exists in the ordering voice is directly compared, semantic understanding and food name recognition are realized from the ordering voice signal, ASR is not relied on to convert the voice into characters, and the food name in the words is recognized through the semantic understanding of the characters, so that the process of understanding the voice by human is closer to the process of understanding the voice by human; therefore, the accuracy of food name identification can be effectively improved, and the ordering success rate and accuracy are improved.

The communication connection establishing system provided by the embodiment of the application acquires communication instruction voice data through user equipment and sends the voice data to a server; the server side constructs a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; executing communication connection establishment processing according to the communication user information; the processing mode leads to the introduction of contact map information, directly compares whether the name pronunciation in the knowledge map exists in the call instruction voice, realizes semantic understanding and food name recognition in the call instruction voice signal, does not depend on ASR to convert the voice into characters, and recognizes the name in the characters through the semantic understanding of the characters, thereby being closer to the process of understanding the voice by human; therefore, the accuracy of name identification can be effectively improved, and the success rate and the accuracy of communication are improved.

According to the voice interaction system provided by the embodiment of the application, the voice data is collected through the terminal equipment, and the voice data is sent to the server side; the server side constructs an entity knowledge base, and entity information in the voice data is determined through a voice entity recognition model and the entity knowledge base; performing voice interaction processing according to the entity information; the processing mode leads to the introduction of entity knowledge map information, directly compares whether entity pronunciation in the knowledge map exists in the voice, realizes semantic understanding and entity recognition from the voice signal, does not depend on ASR to convert the voice into characters, and recognizes the entity name in the characters through semantic understanding of the characters, thus being closer to the process of understanding the voice by human; therefore, the accuracy of voice entity recognition can be effectively improved, and the voice interaction accuracy is improved.

The television program playing system provided by the embodiment of the application is used for acquiring program playing voice instruction data of a user through the smart television, sending the voice instruction data to the server and playing a target program object; the server is used for constructing a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name; the processing mode leads to the introduction of program related entity map information, directly compares whether the program related entity pronunciation in the knowledge map exists in the program playing voice instruction, realizes the recognition of entities such as semantic comprehension, program name and the like from the program playing voice instruction signal, does not depend on ASR to convert the voice into characters, and recognizes the program name in the characters through the semantic comprehension of the characters, thus being more close to the process of understanding the voice by human beings; therefore, the accuracy of program entity identification can be effectively improved, and the success rate and the accuracy of the video program on demand are improved.

The conference recording system provided by the embodiment of the application acquires voice data of a target conference through terminal equipment and sends the voice data to a server; the server side constructs a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record; the processing mode leads to the introduction of entity map information such as related terms of the conference field, directly compares whether the conference voice data has pronunciations of entities such as related terms of the conference field in the knowledge map, realizes the recognition of entities such as semantic understanding and related terms of the conference field from the conference voice data signal, does not depend on ASR to convert the voice into characters, and recognizes related terms of the conference field in the conference voice data signal through the semantic understanding of the characters, thereby being more close to the process of understanding the voice by human beings; therefore, the accuracy of identifying related terms in the conference field can be effectively improved, and the success rate and the accuracy of recording the conference are improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a multimedia program on demand system provided in the present application;

FIG. 2 is a schematic view of a multimedia program on demand system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an apparatus interaction of an embodiment of a multimedia program on demand system provided by the present application;

FIG. 4 is a system architecture diagram of an embodiment of a multimedia program on demand system provided by the present application;

FIG. 5 is a schematic diagram of a speech entity recognition model of an embodiment of a multimedia program on demand system provided by the present application;

FIG. 6 is a schematic view of a knowledge graph of an embodiment of a multimedia program on demand system provided by the present application;

FIG. 7 is a schematic diagram of another speech entity recognition model of an embodiment of a multimedia program on demand system provided by the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides a multimedia program on demand system, a multimedia program on demand method and a multimedia program on demand device, a food ordering system, a food ordering method and a food ordering device, a communication connection establishing system, a communication connection establishing method and a communication connection establishing device, a voice interaction system, a voice interaction method and a voice entity recognition model establishing device, an entity knowledge base establishing method and a voice entity recognition model establishing device, a television program on demand method and a television program on demand device, a conference recording method and a conference recording device, an intelligent sound box, an intelligent television, a food ordering machine, user equipment and electronic equipment. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a schematic diagram of an embodiment of a multimedia program on demand system according to the present application. The multimedia program on demand system provided by the embodiment comprises: server 1 and intelligent audio amplifier 2.

The server 1 may be a server deployed on a cloud server, or may be a server dedicated to implementing a multimedia program on demand system, and may be deployed in a data center.

The smart speaker 2 may be a tool for a home consumer to surf the internet by voice, such as ordering songs, shopping on the internet, or knowing weather forecast, and may also control smart home devices, such as opening a curtain, setting a temperature of a refrigerator, raising a temperature of a water heater in advance, and the like.

Please refer to fig. 2, which is a schematic view of a multimedia program on demand system according to the present application. The server 1 and the smart speaker 2 can be connected via a network, for example, the smart speaker 2 can be networked via WIFI, and the like. The user interacts with the intelligent sound box in a voice mode. In the embodiment, a user issues a song-ordering voice instruction to the intelligent sound box 2, and the server determines song name information in the song-ordering voice instruction through a voice entity recognition model and a pre-established multimedia program knowledge base; a process of playing the song is performed.

Please refer to fig. 3, which is a schematic diagram of an apparatus of the multimedia program on demand system of the present application. In this embodiment, the smart sound box is configured to collect multimedia program on-demand voice data and send the voice data to the server; the server is used for constructing a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

The multimedia program may be a song, a movie, a television program, a lecture video, etc. The multimedia program knowledge base comprises program related entity information. The program-related entity information includes, but is not limited to, at least one of the following entities: program names (e.g., song names, movie names, television program names, speaker names, etc.), program related person names (e.g., artist names, movie director names, etc.), and so forth.

The server side needs to construct a multimedia program knowledge base, and can adopt the following modes: and determining related entities of the multimedia program to form a multimedia program knowledge base. Table 1 shows the content of the knowledge base of multimedia programs in the present embodiment.

TABLE 1 entity data

As can be seen from fig. 4, at the smart speaker side, the analog voice signal is converted into a digital voice signal after passing through the receiving transducer of the smart speaker and the corresponding digital signal processing unit; the portion may also include a portion of the acoustic processing that produces corresponding speech intermediate results, which may include, but are not limited to, sound spectrum signals, phonemes, words, word fragments, pinyin, and the like. The intelligent sound box can upload the digital voice signal to the server side, the server side models the digital voice signal through the voice entity recognition model, and meanwhile, entity names (such as song names and the like) mentioned in voice are recognized by combining entity information in a knowledge graph (namely a knowledge base).

In specific implementation, the speech entity recognition model may be a machine learning model (such as a deep neural network model, a bayesian model, etc.), or may be a non-machine learning model (such as a heuristic model, etc.).

The speech entity recognition model can directly carry out semantic understanding and entity recognition from speech signals, speech recognition ASR is not relied on to convert speech into characters, and entity names in the characters are recognized through semantic understanding of the characters. By introducing the knowledge graph, whether entity pronunciation in the knowledge graph exists in the voice can be directly compared, the process of understanding the voice by human is closer, a more accurate entity recognition result is obtained, and the problems of inaccurate pronunciation, unclear pronunciation and multiple characters in one tone of a user are effectively solved.

The following describes specific problems that can be solved by the system provided by the embodiment of the present application through several application scenario examples:

scene 1:

the intelligent sound box is often used for ordering songs when a user drives a car in a small Zhao, due to the problem of local accents, front/rear nasal sounds cannot be distinguished, and curled-tongue sounds/warped-tongue sounds cannot be distinguished, and the intelligent sound box in the traditional voice entity recognition system executes wrong instructions due to tasting, for example, the Chinese pinyin corresponding to the small Zhao voice instruction is 'wo xing ting bie zi ji', the intelligent sound box recognizes the text 'i want to listen to oneself', finally recognizes 'other oneself' as a song name, but does not have the song 'other oneself', and the song that the small Zhao really wants to listen to is 'other oneself'. By adopting the system provided by the embodiment of the application, the knowledge base comprises the song name 'know you', so that the song can be accurately identified and can be correctly played.

Scene 2:

the user beans are 5 years old children, often use the area screen intelligence audio amplifier in the family to watch the cartoon, because beans age is too little can not type, only can search through pronunciation. However, the pronunciation of beans is not clear, and the ordering often fails, so that the ordering needs help of parents, for example, after the beans send a voice ordering instruction, the intelligent sound box recognizes the text "i want to see the station" through ASR, and the beans actually want to express "i want to see the zha". By adopting the system provided by the embodiment of the application, the knowledge base comprises the movie name 'Nezha', so that the movie can be accurately identified and played correctly.

Scene 3:

when a user king is sweeping housework, the user wants to listen to music to relieve fatigue, the user working with two hands clicks a song to an intelligent sound box at home through voice, pinyin corresponding to a voice instruction is ' wo yao ting lei yu xin de ji nian ', the intelligent sound box device is recognized as a text ' i want to listen to the souvenir of the thunderstorm heart ' through ASR ', and actually what the user king wants to play is ' memorial ' rather than ' memorial ' singing of the thunderstorm heart. By adopting the system provided by the embodiment of the application, the knowledge base comprises the song name 'memorial', so that the song can be accurately identified and can be correctly played.

Please refer to fig. 5, which is a schematic diagram of a speech entity recognition model of the multimedia program on demand system according to the present application. In one example, to determine entity information in the target voice data through the voice entity recognition model and the knowledge base, the following processing procedures may be adopted: firstly, determining audio characteristic data of the voice data through an audio coding model included in the voice entity recognition model; then, the multimedia program information is determined according to the audio characteristic data through an entity decoding model included in the voice entity recognition model and the knowledge base.

Wherein the input data of the audio coding model may comprise a sequence of audio frames, each audio frame may comprise acoustic feature data of an audio signal or the like. The output data of the entity decoding model may include pinyin, but in practical applications, may be any representation of an entity, such as text, phoneme, etc.

In specific implementation, an original audio signal may be processed by an audio signal processing module (mel filter, etc.) to form an audio frame sequence. The audio coding model can adopt a common coding model, and the model structure includes but is not limited to LSTM, transform and the like; the network structure of the entity decoding model includes but is not limited to LSTM, Transformer, etc.

In one example, the entity candidate pronunciation determination module included by the entity decoding model determines at least one candidate pronunciation of the multimedia program information according to the audio feature data; determining, by an entity pronunciation determination module included in the entity decoding model, a pronunciation of the multimedia program information from the at least one candidate pronunciation according to the knowledge base; and determining the multimedia program information according to the pronunciation of the multimedia program information. The input to the model is an audio signal (sequence of audio frames) and the output is the name of an entity contained in the audio (e.g., a pinyin sequence). In this model, sounds that would produce confusion would produce candidates for all possible sounds within the decoding model, and then more likely sounds would be determined by the solid knowledge base. In particular, the decoding portion may automatically find a suitable location of the encoding network to generate candidate utterances.

In a specific implementation, the determining the pronunciation of the multimedia program information from the at least one candidate pronunciation according to the knowledge base may include the following sub-steps: 1) determining similarity of the pronunciation of the entity in the knowledge base and the candidate pronunciation; 2) and determining the pronunciation of the multimedia program information according to the similarity, such as ranking candidate pronunciations with the similarity larger than a similarity threshold value at a high level as the pronunciation of the multimedia program information. Wherein, the similarity calculation part can be neural network automatic learning.

In specific implementation, the server can be used for learning the speech entity recognition model from training data; wherein the training data comprises: audio data and multimedia program annotation information.

In specific implementation, the server can also be used for determining an entity knowledge base and a training data set and constructing a network structure of the model; and taking the voice data as input data of the model, taking the multimedia program marking information as output data of the model, and training network parameters of the model according to the knowledge base.

In one example, if the knowledge base includes program related entities with homophones and characters, such as song names including two songs of "commemoration" and "memorial", the knowledge base may further include: user entities, and entity relationships between program related entities and user entities. Table 2 shows the entity relationship between the program related entity and the user entity in this embodiment.

TABLE 2 entity relationship data

As can be seen from table 2, the entity relationship may be a corresponding relationship between the user entity and the program entity, or a corresponding relationship between the user entity and the name of the program related to the user entity. Table 3 shows user entity information in the present embodiment.

User entity identification	User account (Taobao account)
		1	Abdgdf001
2	Hanhao55
		…

TABLE 3 user entity data

In one example, the entity relationship between the program related entity and the user entity may be determined as follows: and determining the user entity according to the historical playing information of the user, and constructing the entity relationship. Table 4 shows the user history play information in the present embodiment.

Play record identification	User entity	Program related entities	Time
					1	Zhang San	Memorial idea	20200526
2	Zhang San	Go back to the head	20200315
					…
23	Li Si	Zhou Jie Lun	20190230
				24	Li Si	Semi-kettle yarn	20200511
	…

TABLE 4 user History Play data

Please refer to fig. 6, which is a schematic diagram of a knowledge graph of the multimedia program on demand system of the present application. In one example, the knowledge base further includes user-type entities, and a user entity may have an entity relationship with some program-related entities but not with other program-related entities. For example, if the user a requests the song "memorial", there is an association with "memorial" and no association with "commemoration". As can be seen from fig. 6, the knowledge graph describes entities (points) and relationships (edges) between the entities, the user who issued the audio command is an entity in the knowledge graph, and the song is also an entity in the knowledge graph, and the song with the same pronunciation is determined by using the historical song listening records of the user. In addition, the co-occurrence relationship of songs in the play list can be utilized to determine songs with the same pronunciation.

Correspondingly, the server determines the multimedia program information according to the pronunciation of the multimedia program information, and may include the following sub-steps: 1) determining candidate entities such as two candidate entities of 'memorial' and 'commemoration' corresponding to 'ji nian' according to the pronunciation of the multimedia program information; 2) and determining the multimedia program information from the candidate entities according to the user information and the entity relationship. For example, it is ultimately determined that user A requests the song "memorial", rather than "memorial".

Please refer to fig. 7, which is a schematic diagram of another speech entity recognition model of the multimedia program on demand system of the present application. In one example, the server is specifically configured to determine pronunciation feature data of the entity in the knowledge base; and determining the multimedia program information according to the audio characteristic data and the entity pronunciation characteristic data through an entity decoding model included in the voice entity recognition model.

As can be seen from fig. 7, audio feature vectors can be calculated for an input audio signal (sequence of audio frames) by means of an audio coding model comprised by the speech entity recognition model; calculating a feature vector for an entity name (such as a pinyin sequence) through an entity coding model included in the voice entity recognition model; the speech entity recognition model may bring the audio feature vector closer to the feature vector of the entity name it contains. In the model shown in fig. 7, the feature vectors are not sensitive to aliasing sounds, i.e. although some sounds are mispronounced, the feature vectors calculated by the audio frequencies are still relatively close.

In a specific implementation, the server determines the multimedia program information according to the audio feature data and the entity pronunciation feature data through an entity decoding model included in the speech entity recognition model, and may include the following sub-steps: 1) determining pronunciation similarity between the entity in the voice data and the entity in the knowledge base according to the audio characteristic data and the entity pronunciation characteristic data; 2) and determining the multimedia program information according to the pronunciation similarity, for example, arranging a knowledge base entity with the similarity greater than a similarity threshold at a high level as the multimedia program information.

The audio coding model and the entity decoding model may adopt common coding models, and the model structure includes, but is not limited to, LSTM, Transformer, and the like. The entity decoding model may include a distance function by which the pronunciation similarity of an entity in the speech data to an entity in the knowledge base may be determined, which may be a point product, a euclidean distance, a cosine distance, or the like.

In specific implementation, the server can be used for learning from training data to obtain the voice entity recognition model and the entity pronunciation feature data; wherein the training data comprises: audio data, entity tagging information, and a knowledge base. With the model shown in fig. 7, the pronunciation feature vectors of the entity can be calculated off-line and stored, and then matched with the audio feature vectors on-line.

In one example, the knowledge base includes: the entity relation among the program related entities, the user entities and the user entities of homophone different characters; the server side needs to construct an entity knowledge base, and the method comprises the following steps: determining the user entity according to the historical playing information of the user, and constructing the entity relationship; and is specifically configured to determine a candidate entity according to the pronunciation similarity; and determining the entity information from the candidate entities according to the user information and the entity relationship.

As can be seen from the foregoing embodiments, the multimedia program on demand system provided in the embodiment of the present application sends the voice data to the server through the intelligent sound box multimedia program on demand voice data; playing the multimedia program according to the multimedia program playing processing result of the server; the server is used for constructing a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; executing multimedia program playing processing according to the multimedia program information; the processing mode leads to the introduction of the knowledge map information of the multimedia program, directly compares whether the multimedia program entity in the knowledge map pronounces in the multimedia program on-demand voice, realizes semantic understanding and entity recognition from the voice signal, does not depend on ASR to convert the voice into characters, and recognizes the multimedia program entity name in the words through the semantic understanding of the characters, thus being closer to the process of understanding the voice by human beings; therefore, the accuracy of identifying the multimedia program name can be effectively improved, and the success rate and the accuracy of ordering the multimedia program are improved.

Second embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and accordingly, the present application also provides a voice interaction method, where an execution subject of the method may be a server, or may also be an intelligent sound box, an intelligent television, a vending machine, a ticket vending machine, a chat robot, and so on. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The voice interaction method provided by the embodiment of the application can comprise the following steps:

step 1: constructing an entity knowledge base;

step 2: determining entity information in the target voice data through a voice entity recognition model and the entity knowledge base;

and step 3: and executing voice interaction processing according to the entity information.

In one example, step 2 may include the following sub-steps:

step 2-1: determining audio characteristic data of the voice data through an audio coding model included in the voice entity recognition model;

step 2-2: and determining the entity information according to the audio characteristic data through an entity decoding model included in the voice entity recognition model and the entity knowledge base.

In one example, step 2-2 may include the following sub-steps:

step 2-2-1: determining at least one candidate pronunciation of the entity information according to the audio characteristic data through an entity candidate pronunciation determination module included in the entity decoding model;

step 2-2-2: determining, by an entity pronunciation determination module included in the entity decoding model, a pronunciation of the entity information from the at least one candidate pronunciation according to the entity knowledge base;

step 2-2-3: and determining the entity information according to the pronunciation of the entity information.

In one example, step 2-2-2 may include the following sub-steps:

step 2-2-2-1: determining similarity of the pronunciation of the entity in the entity knowledge base and the candidate pronunciation;

step 2-2-2-2: and determining the pronunciation of the entity information according to the similarity.

In one example, the entity repository includes: a program entity knowledge base in the multimedia program on demand field; the program entity knowledge base comprises: the entity relation among the program related entities, the user entities and the user entities of homophone different characters; step 1 may comprise the following sub-steps: determining the user entity according to the historical playing information of the user, and constructing the entity relationship; accordingly, step 2-2-3 may comprise the following sub-steps:

step 2-2-3-1: determining candidate entities according to the pronunciations of the entity information;

step 2-2-3-2: and determining the entity information from the candidate entities according to the user information and the entity relationship.

In one example, the method may further comprise the steps of: learning from training data to obtain the speech entity recognition model; wherein the training data comprises: audio data and entity tagging information.

In another example, step 2-2 may include the following sub-steps:

step 2-2-1': determining audio characteristic data of the voice data through an audio coding model included in the voice entity recognition model;

step 2-2-2': determining pronunciation characteristic data of an entity in the entity knowledge base through an entity coding model included in the voice entity recognition model;

step 2-2-3': and determining the entity information according to the audio characteristic data and the entity pronunciation characteristic data through an entity decoding model included in the voice entity recognition model.

In one example, step 2-2-3' may include the following sub-steps:

step 2-2-3' -1: determining pronunciation similarity between an entity in the voice data and an entity in the entity knowledge base according to the audio characteristic data and the entity pronunciation characteristic data;

step 2-2-3' -2: and determining the entity information according to the pronunciation similarity.

In one example, the entity repository includes: a program entity knowledge base in the multimedia program on demand field; the program entity knowledge base comprises: the entity relation among the program related entities, the user entities and the user entities of homophone different characters; step 1 may comprise the following sub-steps: determining the user entity according to the historical playing information of the user, and constructing the entity relationship; accordingly, step 2-2-3' -2: may include the following substeps:

step 2-2-3' -2-1: determining candidate entities according to the pronunciation similarity;

step 2-2-3' -2-2: and determining the entity information from the candidate entities according to the user information and the entity relationship.

In one example, the method may further comprise the steps of: learning from training data to obtain the speech entity recognition model; wherein the training data comprises: audio data, an entity knowledge base, and entity tagging information.

In one example, the entity repository includes: a program entity knowledge base in the multimedia program on demand field; step 1 may comprise the following sub-steps: and determining related entities of the multimedia program to form the entity knowledge base.

Third embodiment

In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application further provides a voice interaction apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a voice interaction device includes:

Fourth embodiment

The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: constructing an entity knowledge base; determining entity information in the target voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

The electronic equipment can be an intelligent sound box, an intelligent television, a food ordering machine, a vending machine, a ticket vending machine, a chat robot and the like.

Fifth embodiment

In the foregoing embodiment, a voice interaction system is provided, and correspondingly, the present application also provides a voice interaction method, where an execution subject of the method may be a smart speaker, a smart television, a vending machine, a ticket vending machine, a chat robot, and so on. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the second embodiment will not be described again, please refer to corresponding parts in embodiment two.

The voice interaction method provided by the application can comprise the following steps: collecting voice data, and sending the voice data to a server so that the server constructs an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

Sixth embodiment

The application provides a voice interaction device includes:

the voice data acquisition unit is used for acquiring voice data;

Seventh embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: collecting voice data, and sending the voice data to a server so that the server constructs an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

Eighth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the present application further provides a voice interaction system. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a voice interaction system, including: terminal equipment and server.

The terminal equipment is used for acquiring voice data and sending the voice data to the server side; the server is used for constructing an entity knowledge base; determining entity information in the voice data through a voice entity recognition model and the entity knowledge base; and executing voice interaction processing according to the entity information.

Ninth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the present application also provides a multimedia program on demand method, where an execution main body of the method may be an intelligent sound box or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The multimedia program on demand method provided by the application can comprise the following steps: collecting multimedia program on-demand voice data, and sending the voice data to a server side so that the server side can construct a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

Tenth embodiment

In the above embodiment, a multimedia program on demand method is provided, and correspondingly, the application also provides a multimedia program on demand device. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a multimedia program on demand device includes:

Eleventh embodiment

The application also provides an intelligent sound box. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An intelligent audio amplifier of this embodiment, this intelligent audio amplifier includes: a processor and a memory; a memory for storing a program for implementing the method for requesting a multimedia program, the device being powered on and running the program for the method via the processor for performing the following steps: collecting multimedia program on-demand voice data, and sending the voice data to a server side so that the server side can construct a multimedia program knowledge base; determining multimedia program information in the voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

Twelfth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the present application also provides a multimedia program on demand method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The multimedia program on demand method provided by the application can comprise the following steps:

step 1: constructing a multimedia program knowledge base;

step 2: determining multimedia program information in the multimedia program on demand voice data through a voice entity recognition model and the knowledge base;

and step 3: and executing multimedia program playing processing according to the multimedia program information.

Thirteenth embodiment

The application provides a multimedia program on demand device includes:

Fourteenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the method for requesting a multimedia program, the device being powered on and running the program for the method via the processor for performing the following steps: constructing a multimedia program knowledge base; determining multimedia program information in the multimedia program on demand voice data through a voice entity recognition model and the knowledge base; and executing multimedia program playing processing according to the multimedia program information.

Fifteenth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the present application further provides a speech entity recognition model construction method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The method for constructing the speech entity recognition model provided by the application can comprise the following steps:

step 1: determining a training data set, the training data comprising: voice data, entity tagging information and an entity knowledge base;

step 2: constructing a network structure of the model;

and step 3: the model is learned from a training data set.

In one example, the model comprises an audio coding model for determining audio characteristic data of the speech data; the model comprises an entity decoding model, and entity information in the voice data is determined according to the audio characteristic data and the entity knowledge base.

In another example, the model comprises an audio coding model for determining audio characteristic data of the speech data; the model comprises an entity coding model used for determining pronunciation characteristic data of the entity in the entity knowledge base; the model comprises an entity decoding model used for determining entity information in the voice data according to the audio characteristic data and the entity pronunciation characteristic data.

As can be seen from the foregoing embodiments, in the speech entity recognition model construction method provided in the embodiments of the present application, by determining a training data set, the training data includes: voice data, entity tagging information and an entity knowledge base; constructing a network structure of the model; learning the model from a training data set; the processing mode leads to the introduction of entity knowledge map information, directly compares whether entity pronunciation in the knowledge map exists in the voice, realizes semantic understanding and entity recognition from the voice signal, does not depend on ASR to convert the voice into characters, and recognizes the entity name in the characters through semantic understanding of the characters, thus being closer to the process of understanding the voice by human; therefore, the accuracy of the speech entity recognition model can be effectively improved.

Sixteenth embodiment

The application provides a multimedia program on demand device includes:

Seventeenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a speech entity recognition model building method, the apparatus performing the following steps after being powered on and running the program of the method through the processor: determining a training data set, the training data comprising: voice data, entity tagging information and an entity knowledge base; constructing a network structure of the model; the model is learned from a training data set.

Eighteenth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the application further provides a method for constructing an entity knowledge base, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The entity knowledge base construction method provided by the application can comprise the following steps:

step 1: acquiring an entity name of a target field;

step 2: and generating an entity knowledge base of the target field according to the entity name, wherein the entity knowledge base is used for determining entity information in the voice data of the target field through a voice entity recognition model and the entity knowledge base.

The target areas include, but are not limited to: the multimedia program ordering field, the communication field, and the like.

As can be seen from the above embodiments, the method for constructing an entity knowledge base provided in the embodiments of the present application obtains the entity name of the target field; generating an entity knowledge base of the target field according to the entity name, wherein the entity knowledge base is used for determining entity information in voice data of the target field through a voice entity recognition model and the entity knowledge base; the processing mode can construct entity knowledge map information, lays a data foundation for constructing a speech entity recognition model, can introduce the entity knowledge map information into the model, directly compares whether entity pronunciation in the knowledge map exists in speech, realizes semantic understanding and entity recognition from a speech signal, does not depend on ASR to convert the speech into characters, and recognizes entity names in the characters through semantic understanding of the characters, thereby being closer to the process of understanding the speech by human beings; therefore, the accuracy of the speech entity recognition model can be effectively improved.

Nineteenth embodiment

In the embodiment, the application further provides an entity knowledge base construction device corresponding to the entity knowledge base construction method. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides an entity knowledge base construction device includes:

Twentieth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the entity knowledge base construction method, the device executing the following steps after being powered on and running the program of the method through the processor: acquiring an entity name of a target field; and generating an entity knowledge base of the target field according to the entity name, wherein the entity knowledge base is used for determining entity information in the voice data of the target field through a voice entity recognition model and the entity knowledge base.

Twenty-first embodiment

In the above embodiment, a multimedia program on demand system is provided, and correspondingly, the present application also provides a speech entity recognition method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The speech entity recognition method provided by the application can comprise the following steps:

step 1: constructing an entity knowledge base and a voice entity recognition model;

step 2: determining target voice data;

and step 3: and determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base.

As can be seen from the foregoing embodiments, the speech entity recognition method provided in the embodiments of the present application constructs an entity knowledge base and a speech entity recognition model; determining target voice data; determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base; the processing mode can construct entity knowledge map information, lays a data foundation for constructing a speech entity recognition model, can introduce the entity knowledge map information into the model, directly compares whether entity pronunciation in the knowledge map exists in speech, realizes semantic understanding and entity recognition from a speech signal, does not depend on ASR to convert the speech into characters, and recognizes entity names in the characters through semantic understanding of the characters, thereby being closer to the process of understanding the speech by human beings; therefore, the accuracy of the speech entity recognition can be effectively improved.

Twenty-second embodiment

In the foregoing embodiment, a method for recognizing a speech entity is provided, and correspondingly, an apparatus for recognizing a speech entity is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a speech entity recognition device includes:

the model building unit is used for recognizing the model by the voice entity;

a voice data determination unit for determining target voice data;

Twenty-third embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a speech entity recognition method, the device performing the following steps after being powered on and running the program of the method by the processor: constructing an entity knowledge base and a voice entity recognition model; determining target voice data; and determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base.

Twenty-fourth embodiment

In the above embodiment, a multimedia program on demand system is provided, and correspondingly, the application further provides a food ordering system. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides an ordering system includes: ordering equipment and a server.

The ordering device is used for collecting ordering voice data and sending the voice data to the server; the server is used for constructing a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

The food knowledge base may include various names of dishes, such as New Orleans hamburger, French fries, latte coffee, etc.

The meal preparation processing can be realized by sending the ordering information to a kitchen room, sending the ordering information to a front-desk seller, and the like.

For example, a user orders a meal through a meal ordering device by voice, and sends a meal ordering voice instruction 'a new orleaning hamburger and a latte coffee' to the meal ordering device, but since the user is uncluttered, the user cannot recognize an accurate text through a traditional voice recognition algorithm, possibly recognize other names of meal items, or cannot determine which kind of meal item the user orders; by adopting the system provided by the embodiment of the application, the food information can be accurately determined directly according to the food ordering voice instruction of the user through the voice entity recognition model and the food knowledge base.

As can be seen from the above embodiments, the ordering system provided in the embodiments of the present application collects ordering voice data through ordering equipment, and sends the voice data to a server; the server side constructs a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; executing food preparation processing according to the food information; by adopting the processing mode, the food knowledge map information is introduced, whether the food name pronunciation in the knowledge map exists in the ordering voice is directly compared, semantic understanding and food name recognition are realized from the ordering voice signal, ASR is not relied on to convert the voice into characters, and the food name in the words is recognized through the semantic understanding of the characters, so that the process of understanding the voice by human is closer to the process of understanding the voice by human; therefore, the accuracy of food name identification can be effectively improved, and the ordering success rate and accuracy are improved.

Twenty-fifth embodiment

In the foregoing embodiments, an ordering system is provided, and correspondingly, the present application also provides an ordering method, where an execution subject of the method may be an ordering device or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The ordering method provided by the application can comprise the following steps: collecting ordering voice data, and sending the voice data to a server side so that the server side can construct a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

Twenty-sixth embodiment

In the above embodiment, an ordering method is provided, and correspondingly, an ordering device is also provided in the present application. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a device of ordering includes:

the voice data acquisition unit is used for acquiring ordering voice data;

Twenty-seventh embodiment

The application also provides ordering equipment. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An order equipment of this embodiment, this order equipment includes: a processor and a memory; a memory for storing a program for implementing the ordering method, wherein the device executes the following steps after being powered on and running the program of the method through the processor: collecting ordering voice data, and sending the voice data to a server side so that the server side can construct a food knowledge base; determining food information in the voice data through a voice entity recognition model and the knowledge base; and executing food preparation processing according to the food information.

Twenty-eighth embodiment

In the foregoing embodiment, an ordering system is provided, and correspondingly, the present application also provides an ordering method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The ordering method provided by the application can comprise the following steps:

step 1: constructing a food knowledge base;

step 2: determining food information in the food ordering voice data through a voice entity recognition model and the entity knowledge base;

and step 3: and executing food preparation processing according to the food information.

Twenty-ninth embodiment

The application provides a device of ordering includes:

Thirtieth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the ordering method, wherein the device executes the following steps after being powered on and running the program of the method through the processor: constructing a food knowledge base; determining food information in the food ordering voice data through a voice entity recognition model and the entity knowledge base; and executing food preparation processing according to the food information.

Thirty-first embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, a communication connection establishing system is also provided in the present application. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a communication connection establishment system includes: user equipment and a server.

The user equipment is used for acquiring communication instruction voice data and sending the voice data to the server; the server is used for constructing a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

The communication user knowledge base can comprise user entity information such as contact names and the like.

The user equipment can be mobile communication equipment such as a mobile phone, a smart phone and a smart sound box.

The communication connection establishment process may be a process of dialing a telephone number of a communication user, or the like.

For example, a user sends a voice instruction "call catalpa", which is used for calling a certain person, to a smart phone, but the user is unclearned, so that the accurate name cannot be recognized through a traditional voice recognition algorithm, and other names are possibly recognized; by adopting the system provided by the embodiment of the application, the name information can be accurately determined directly according to the call voice command of the user through the voice entity recognition model and the communication user knowledge base.

In one example, the communication user knowledge base may include "catalpa" and "sub-luxury", and the user a wants to play the role of "catalpa", which is to store the corresponding relationship between the user a and the "catalpa" in the communication user knowledge base, so that the communication user knowledge base is not recognized as "sub-luxury", and therefore, the communication accuracy can be effectively improved.

As can be seen from the foregoing embodiments, the communication connection establishing system provided in the embodiments of the present application acquires communication instruction voice data through user equipment, and sends the voice data to a server; the server side constructs a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; executing communication connection establishment processing according to the communication user information; the processing mode leads to the introduction of contact map information, directly compares whether the name pronunciation in the knowledge map exists in the call instruction voice, realizes semantic understanding and food name recognition in the call instruction voice signal, does not depend on ASR to convert the voice into characters, and recognizes the name in the characters through the semantic understanding of the characters, thereby being closer to the process of understanding the voice by human; therefore, the accuracy of name identification can be effectively improved, and the success rate and the accuracy of communication are improved.

Thirty-second embodiment

In the foregoing embodiments, a communication connection establishing system is provided, and correspondingly, the present application also provides a communication connection establishing method, where an execution subject of the method may be a user equipment or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The communication connection establishing method provided by the application can comprise the following steps: collecting communication instruction voice data, and sending the voice data to a server side so that the server side can construct a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

Thirty-third embodiment

In the foregoing embodiment, a communication connection establishing method is provided, and correspondingly, a communication connection establishing apparatus is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a communication connection establishes the device and includes:

Thirty-fourth embodiment

The application also provides user equipment. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

A user equipment of this embodiment, the user equipment includes: a processor and a memory; a memory for storing a program for implementing the communication connection establishment method, wherein the following steps are executed after the device is powered on and the program for implementing the method is run by the processor: collecting communication instruction voice data, and sending the voice data to a server side so that the server side can construct a communication user knowledge base; determining communication user information in the voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

Thirty-fifth embodiment

In the foregoing embodiments, a communication connection establishing system is provided, and correspondingly, the present application also provides a communication connection establishing method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The communication connection establishing method provided by the application can comprise the following steps:

step 1: constructing a communication user knowledge base;

step 2: determining communication user information in communication instruction voice data through a voice entity recognition model and the knowledge base;

and step 3: and executing communication connection establishment processing according to the communication user information.

Thirty-sixth embodiment

Thirty-seventh embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the ordering method, wherein the device executes the following steps after being powered on and running the program of the method through the processor: constructing a communication user knowledge base; determining communication user information in communication instruction voice data through a voice entity recognition model and the knowledge base; and executing communication connection establishment processing according to the communication user information.

Thirty-eighth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the present application further provides a television program playing system. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a television program playing system, which comprises: the system comprises the smart television and a server.

The intelligent television is used for acquiring program playing voice instruction data of a user, sending the voice instruction data to the server and playing a target program object; the server is used for constructing a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; and executing target program object playing processing according to the target program name.

The television program knowledge base comprises program related entities, such as television program names, actor names, channel names and other entity vocabularies related to television program playing.

In this embodiment, the smart television may collect the program playing voice instruction data of the user through a device such as a remote controller, for example, the user sends a voice instruction "i want to watch the myzha", but the user is unclear and actually pronounces "i want to watch that station"; and the server can identify the program name of the Nezha through a voice entity identification model and the knowledge base. For a specific implementation of the speech entity recognition in this step, reference may be made to the relevant description of the first embodiment, and details are not described here. After the server determines the name of the target program, the server may execute a target program object playing process, such as sending the video stream of the program to a requesting device, and playing the target program object through a terminal device.

In one example, the knowledge base may include program related entities with homophones and characters, such as movie names with the same pronunciation; correspondingly, the knowledge base also comprises user entities and entity relations between program related entities with same tone and different characters and the user entities, namely the knowledge base comprises the relations between users and television programs watched by the users; accordingly, the server can identify the movie name that the user really wants to watch, but not other movie names with the same pronunciation, according to the corresponding relationship by using the related processing method in the first embodiment, so that the accuracy of the program name can be effectively improved. For a specific implementation of the speech entity recognition in this step, reference may be made to the relevant description of the first embodiment, and details are not described here.

In an example, the server executes the target program object playing processing according to the target program name, and may adopt the following processing manner: firstly, according to a program list of each television channel (such as a program list of the last week, which can include program information played in the last week and program information currently being played), determining a television channel and playing time related to the target program name; then, determining a target program object according to the playing time and the television channel; and finally, playing the target program object.

For example, in a wired television in a certain area, a user says "i want to watch that station" in a remote controller, the remote controller first identifies that a program that the user wants to play on or watch back is named "Nezha", then, according to a television program table that can be watched back in a week, which channel and when "Nezha" are played can be searched, and if the channel is found, the program object that can be watched back or the program object that is currently played in the related channel can be played. The following table shows a program table in the present embodiment.

As shown in the above table, the program objects corresponding to the identified target program name may include a plurality of program objects, such as movie version "nazha" played on 1 st 6 th, movie version "nazha" played on 3 rd produced by the upper art studio on the upper sea, and movie version "nazha" played on 52 th group on 5 th. In this case, the server may send, to the smart television, a plurality of program objects played at multiple times by at least one television channel corresponding to the target program name, and the smart television displays the program objects; a user can specify a target program object through the smart television; and the server sends the video stream of the target program object specified by the user to the intelligent television for playing according to the user request. By adopting the processing mode, all related program objects can be displayed on the television screen for the user to select, and then the target program object specified by the user is played.

In specific implementation, if the server detects that the program list does not include the target program name, determining a program name related to the target program name; displaying the related program name; and if the user specifies to play the related program object, playing the related program object. By adopting the processing mode, if the program which is firstly watched by the user is not found, the relevant television program can be recommended to the user, for example, if the user wants to watch the recording sheet of 'western lake impression', but the recording sheet is not played in the last week, other programs related to the western lake, such as 'decryption of the temple', 'ten views of the western lake', 'the scene boat' and the like, can be played.

As can be seen from the foregoing embodiments, the television program playing system provided in the embodiments of the present application is configured to acquire program playing voice instruction data of a user through a smart television, send the voice instruction data to a server, and play a target program object; the server is used for constructing a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name; the processing mode leads to the introduction of program related entity map information, directly compares whether the program related entity pronunciation in the knowledge map exists in the program playing voice instruction, realizes the recognition of entities such as semantic comprehension, program name and the like from the program playing voice instruction signal, does not depend on ASR to convert the voice into characters, and recognizes the program name in the characters through the semantic comprehension of the characters, thus being more close to the process of understanding the voice by human beings; therefore, the accuracy of program entity identification can be effectively improved, and the success rate and the accuracy of the video program on demand are improved.

Thirty-ninth embodiment

In the foregoing embodiment, a television program playing system is provided, and correspondingly, the application further provides a television program playing method, where an execution main body of the method may be an intelligent television, a television remote controller, and the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The television program playing method provided by the application can comprise the following steps:

step 1: collecting program playing voice instruction data of a user;

step 2: sending the voice instruction data to a server so that the server can conveniently construct a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name;

and step 3: and playing the target program object.

Fortieth embodiment

In the foregoing embodiment, a method for playing a television program is provided, and correspondingly, a device for playing a television program is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a television program playing device includes:

and the program playing unit is used for playing the target program object.

Forty-first embodiment

The application also provides an intelligent television. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The smart television of this embodiment, this smart television includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: collecting program playing voice instruction data of a user; sending the voice instruction data to a server so that the server can conveniently construct a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; executing target program object playing processing according to the target program name; and playing the target program object.

Forty-second embodiment

The application also provides a remote controller. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

A remote controller of the present embodiment, the remote controller includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: collecting program playing voice instruction data of a user; sending the voice instruction data to a server so that the server can conveniently construct a television program knowledge base; determining a target program name corresponding to the voice instruction data through a voice entity recognition model and the knowledge base; and executing target program object playing processing according to the target program name.

Forty-third embodiment

In the foregoing embodiment, a television program playing system is provided, and correspondingly, the application further provides a television program playing method, where an execution subject of the method may be a server, an intelligent television, and the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

step 1: building a television program knowledge base;

step 2: determining a target program name corresponding to the target program playing voice instruction data through a voice entity recognition model and the knowledge base;

and step 3: and executing target program object playing processing according to the target program name.

In one example, the knowledge base includes, but is not limited to: the entity relation among the program related entities, the user entities and the user entities of homophone different characters; step 1 can be realized in the following way: and determining the user entity according to the historical playing information of the user, and constructing the entity relationship.

In one example, step 3 may include the following sub-steps: 3.1) determining a television channel and playing time corresponding to the target program name according to a program list; 3.2) determining a target program object according to the playing time and the television channel; 3.3) executing the processing of playing the target program object.

In specific implementation, if the execution subject of the method is a server, the video stream of the target program object can be sent to the smart television for playing; if the execution subject of the method is the intelligent television, the target program object can be played.

In one example, step 3.2 may comprise the sub-steps of: 3.2.1) displaying a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times; 3.2.2) taking the program object specified by the user as the target program object.

In specific implementation, if the execution subject of the method is a server, a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times can be sent to the smart television for display; if the execution subject of the method is the intelligent television, a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times can be directly displayed.

In one example, step 3.2 may comprise the sub-steps of: 3.2.3) if the program list does not comprise the target program name, determining the program name related to the target program name; 3.2.4) displaying the related program name; 3.2.5) if the user designates the playback of the related program object, the processing for playing back the related program object is executed.

When the method is specifically implemented, if the execution main body of the method is a server, the related program name can be sent to the intelligent television for display; if the execution subject of the method is the intelligent television, the related program name can be directly displayed.

Forty-fourth embodiment

The application provides a television program playing device includes:

Forty-fifth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: building a television program knowledge base; determining a target program name corresponding to the target program playing voice instruction data through a voice entity recognition model and the knowledge base; and executing target program object playing processing according to the target program name.

Forty-sixth embodiment

In the foregoing embodiment, a multimedia program on demand system is provided, and correspondingly, the present application further provides a conference recording system. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a meeting recording system includes: terminal equipment and server.

The terminal equipment is used for acquiring voice data of a target conference and sending the voice data to the server side; the server is used for constructing a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

The language knowledge base of the conference domain includes but is not limited to: terminology of the corresponding field, and the like. That is, the entity information in the target conference voice data may be a term of art of the corresponding field.

The conference field can be various application fields, such as a computer field, a medical field, a legal field, a patent field and the like. In specific implementation, language knowledge bases in a plurality of conference fields can be constructed, such as a language knowledge base in a computer field, a language knowledge base in a medical field, a language knowledge base in a legal field, a language knowledge base in a patent field and the like.

In specific implementation, aiming at a conference field, the language knowledge of the field can be determined according to various text data and multimedia data of the field, and a corresponding language knowledge base is formed.

Taking an international conference in a certain computer field as an example, the conference language is English, the participants comprise technical personnel from multiple countries, and the professional vocabulary English pronunciation of some personnel is not clear and accurate. By adopting the system provided by the embodiment of the application, the voice data of a conference speaker can be collected through terminal equipment in a conference field, the field term vocabulary (entity information) in the conference voice data collected in the field can be determined through a voice entity recognition model and a language knowledge base (including various terms in the field) in the conference field through a server, and the text sequence corresponding to the conference voice data can be determined through the voice recognition model and the recognized term vocabulary, so that a conference record is formed.

In one example, the server is further configured to determine a conference domain corresponding to the conference voice data. In specific implementation, a user may specify a conference domain, for example, a conference domain may be specified when starting a conference record; the meeting area may also be automatically determined by other means.

As can be seen from the foregoing embodiments, the conference recording system provided in the embodiment of the present application acquires voice data of a target conference through a terminal device, and sends the voice data to a server; the server side constructs a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record; the processing mode leads to the introduction of entity map information such as related terms of the conference field, directly compares whether the conference voice data has pronunciations of entities such as related terms of the conference field in the knowledge map, realizes the recognition of entities such as semantic understanding and related terms of the conference field from the conference voice data signal, does not depend on ASR to convert the voice into characters, and recognizes related terms of the conference field in the conference voice data signal through the semantic understanding of the characters, thereby being more close to the process of understanding the voice by human beings; therefore, the accuracy of identifying related terms in the conference field can be effectively improved, and the success rate and the accuracy of recording the conference are improved.

Forty-seventh embodiment

In the foregoing embodiment, a conference recording system is provided, and correspondingly, the present application also provides a conference recording method, where an execution main body of the method may be a terminal device such as a trial court all-in-one machine. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The conference recording method provided by the application can comprise the following steps:

step 1: collecting voice data of a target conference;

step 2: sending the voice data to a server side so that the server side can conveniently construct a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

Forty-eighth embodiment

In the foregoing embodiment, a conference recording method is provided, and correspondingly, the present application further provides a conference recording apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a meeting recorder includes:

Forty-ninth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: collecting voice data of a target conference; sending the voice data to a server side so that the server side can conveniently construct a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

Fifty-fifth embodiment

In the foregoing embodiment, a conference recording system is provided, and correspondingly, the application also provides a conference recording method, where an execution subject of the method may be a server, a trial court all-in-one machine, or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

step 1: constructing a language knowledge base of the conference field;

step 2: determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field;

and step 3: and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

In one example, the method may further comprise the steps of: and determining a conference field corresponding to the conference voice data.

Fifty-first embodiment

The application provides a meeting recorder includes:

Fifty-second embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: constructing a language knowledge base of the conference field; determining entity information in the voice data of the target conference through a voice entity recognition model and a language knowledge base of the conference field; and determining a text sequence corresponding to the conference voice data through a voice recognition model and the entity information to form a conference record.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A multimedia program on demand system, comprising:

2. An ordering system, comprising:

3. A communication connection establishment system, comprising:

4. A voice interaction system, comprising:

5. A method of voice interaction, comprising:

constructing an entity knowledge base;

and executing voice interaction processing according to the entity information.

6. The method of claim 5,

the determining entity information in the target voice data through the voice entity recognition model and the entity knowledge base comprises:

7. The method of claim 6,

the determining the entity information according to the audio characteristic data through the entity decoding model and the entity knowledge base included in the speech entity recognition model includes:

8. The method of claim 7,

the determining the pronunciation of the entity information from the at least one candidate pronunciation according to the entity knowledge base comprises:

9. The method of claim 7,

the entity knowledge base comprises: a program entity knowledge base in the multimedia program on demand field;

the building of the entity knowledge base comprises the following steps:

10. The method of claim 7, further comprising:

learning from training data to obtain the speech entity recognition model;

wherein the training data comprises: audio data and entity tagging information.

11. The method of claim 6,

12. The method of claim 11,

the determining the entity information according to the audio feature data and the entity pronunciation feature data through the entity decoding model included in the speech entity recognition model includes:

13. The method of claim 12,

the building of the entity knowledge base comprises the following steps:

determining candidate entities according to the pronunciation similarity;

14. The method of claim 11, further comprising:

learning from training data to obtain the speech entity recognition model;

15. The method of claim 5,

the building of the entity knowledge base comprises the following steps:

16. A method of voice interaction, comprising:

17. A method for requesting a multimedia program, comprising:

constructing a multimedia program knowledge base;

18. A method for requesting a multimedia program, comprising:

19. An ordering method, comprising:

constructing a food knowledge base;

and executing food preparation processing according to the food information.

20. An ordering method, comprising:

21. A method for establishing a communication connection, comprising:

constructing a communication user knowledge base;

22. A method for establishing a communication connection, comprising:

23. A method for constructing a speech entity recognition model is characterized by comprising the following steps:

constructing a network structure of the model;

the model is learned from a training data set.

24. The method of claim 23,

the models comprise audio coding models for determining audio characteristic data of the speech data;

25. The method of claim 23,

26. A method for speech entity recognition, comprising:

constructing an entity knowledge base and a voice entity recognition model;

determining target voice data;

27. A method for playing a television program, comprising:

building a television program knowledge base;

28. A conference recording method, comprising:

collecting voice data of a target conference;