CN113808593A - Voice interaction system, related method, device and equipment - Google Patents

Voice interaction system, related method, device and equipment Download PDF

Info

Publication number
CN113808593A
CN113808593A CN202010552193.7A CN202010552193A CN113808593A CN 113808593 A CN113808593 A CN 113808593A CN 202010552193 A CN202010552193 A CN 202010552193A CN 113808593 A CN113808593 A CN 113808593A
Authority
CN
China
Prior art keywords
voice
program
language knowledge
determining
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010552193.7A
Other languages
Chinese (zh)
Inventor
郑梓豪
胡于响
姜飞俊
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010552193.7A priority Critical patent/CN113808593A/en
Publication of CN113808593A publication Critical patent/CN113808593A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/239Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests
    • H04N21/2393Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests involving handling client requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/4508Management of client data or end-user data
    • H04N21/4532Management of client data or end-user data involving end-user characteristics, e.g. viewer profile, preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a voice interaction system, a related method, a device and equipment. The system acquires voice data of a target user through an intelligent sound box and sends the voice data to a server; the server generates an individualized language knowledge base including language knowledge of at least one sound box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence. By adopting the processing mode, the voice recognition can be carried out according to the knowledge base updated in real time, so that the higher real-time performance and accuracy of the voice recognition can be effectively considered.

Description

Voice interaction system, related method, device and equipment
Technical Field
The application relates to the technical field of data processing, in particular to a voice interaction system, a voice interaction method and a voice interaction device, a television program playing method and a television program playing device, a conference recording method and a conference recording device, a voice recognition model construction method and a voice recognition model construction device, an intelligent sound box, an intelligent television and electronic equipment.
Background
The intelligent sound box is a product of sound box upgrading, is a tool for household consumers to surf the internet by voice, such as song ordering, internet shopping or weather forecast knowing, and can also control intelligent household equipment, such as opening a curtain, setting the temperature of a refrigerator, heating a water heater in advance and the like.
The user and the intelligent sound box are interacted mainly in a voice mode. A user issues a voice instruction to the intelligent sound box, and the intelligent sound box recognizes the user instruction through an Automatic Speech Recognition (ASR) technology and executes the instruction. In a speaker-based interactive system, there are thousands of entity words, on one hand, there are too many long-tailed entities (e.g., the name of book "beneath the right of the Chinese Han Fei"), the language model is difficult to remember in its entirety, and there are often cases of inverse language models (e.g., the name of music song "lover knot", the address book of the user "catalpa haus vs sub-haugh"), such entities are challenging to ASR, including: a) the training samples of the language model are limited, and the sufficient coverage is difficult to ensure; b) authoring type entities, often "anti-language models" for the purpose of making new insights; c) the language model decoding space is relatively open and wide in range. In order to solve the problem, a typical speech recognition system applied to an intelligent sound box at present is that a special sub-language model is constructed for each sound box skill, and the sub-language model can cover a long-tail entity, an entity of a reverse language model, an entity of homonymous different characters and the like related to the corresponding skill; and then, aiming at the voice data of each sound box skill, recognizing the voice data of the sound box skill through a general language model and a sub language model special for the sound box skill.
However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: 1) when new language knowledge appears in one sound box skill, the corresponding sub-language model needs to be retrained when the language knowledge is updated, and more time is spent, so that higher real-time performance and accuracy of voice recognition cannot be considered; 2) the sublingual model has the problem of sentence pattern dependence, for example, the calling skill of the sound box comprises the sentence pattern of 'calling for Zhengxihao', so that the personalized language knowledge (such as 'catalpo') of one sound box skill cannot be applied to the voice recognition of other sound box skills, thereby influencing the voice recognition accuracy; 3) the speech recognition model in the prior art is a non-end-to-end model, and the model has the problem of error propagation, so that the accuracy of speech recognition is influenced; 4) as a conversational interactive system, the intelligent sound box is important for contextual information, such as in the skill of "you think me guess" in the recent fire, the user guessing the animal, the quiz question of "ask for it? ", the user replies to" you jiao "; if the ASR cannot accurately understand the above, the "you jiao" speech answered by the user may be recognized as a wrong result of "having feet", and the game is difficult to smoothly go on, and the conventional speech recognition model cannot update the sub-language model in real time according to the conversation content, thereby affecting the speech recognition accuracy. In summary, how to provide a universal speech recognition framework to improve the speech recognition accuracy is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
The application provides a voice interaction system to solve the problem that the voice recognition accuracy is low in the prior art. The application further provides a voice interaction method and device, a television program playing method and device, a conference recording method and device, a voice recognition model construction method and device, an intelligent sound box, an intelligent television and electronic equipment.
The application provides a voice interaction system, comprising:
the intelligent sound box is used for acquiring voice data of a target user and sending the voice data to the server;
the server is used for constructing an individualized language knowledge base including language knowledge in the service field of at least one sound box of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The application also provides a voice interaction method, which comprises the following steps:
constructing an individualized language knowledge base including language knowledge of at least one sound box service field of each user;
aiming at voice data of a target user sent by an intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user;
and executing voice interaction processing according to the text sequence.
Optionally, the determining, through an end-to-end speech recognition model and the language knowledge base of the target user, a text sequence corresponding to the speech data includes:
determining a first text feature corresponding to the voice data through a language model included by the voice recognition model;
determining a second text feature corresponding to the voice data according to the language knowledge base and the first text feature;
determining the text sequence based at least on the second text feature.
Optionally, the determining, according to the language knowledge base and the first text feature, a second text feature corresponding to the speech data includes:
and determining a second text characteristic corresponding to the voice data according to the language knowledge base and the first text characteristic through an indicator scoring model included by the voice recognition model.
Optionally, determining, by the indicator scoring model, a degree of correlation between a word corresponding to the first text feature and each piece of linguistic knowledge;
determining a second text feature based at least on linguistic knowledge associated with words having a relevance greater than a relevance threshold.
Optionally, the method further includes:
performing encoding processing on the language knowledge by a language knowledge encoder included in the speech recognition model;
storing the coded data of the language knowledge to a language knowledge storage module included in the speech recognition model;
determining a second text feature corresponding to the speech data according to the language knowledge base and the first text feature, including:
and determining a second text characteristic according to the coded data and the first text characteristic stored by the language knowledge storage module.
Optionally, the method further includes:
learning from a training data set to obtain the speech recognition model; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information.
Optionally, the personalized language knowledge base in the training data is determined by the following method:
and constructing the personalized language knowledge base according to the text sequence marking information of the plurality of training data.
Optionally, the personalized language knowledge base includes: long-tail entity words, entity words of a reverse language model, entity words of homophones and characters, and entity words in context.
Optionally, the speaker service area includes: the field of telephone service;
the language knowledge of the calling service domain includes: the name of the user in the address list;
the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps:
receiving user address book information sent by the intelligent sound box corresponding to the user;
and taking the name in the user address list as the personalized language knowledge of the user.
Optionally, the speaker service area includes: the field of question and answer services;
the language knowledge in the field of question-answering services comprises: a text segment in context;
the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps:
determining a context text sequence;
and taking the text segment in the context text sequence as the personalized language knowledge of the user.
Optionally, the speaker service area includes: the field of multimedia playing services;
the language knowledge in the multimedia playing service field comprises: the name of the song;
the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps:
determining the historical playing program name of the user;
and taking the historical playing program name as the personalized language knowledge of the user.
Optionally, the establishing of the personalized language knowledge base including the language knowledge in the at least one sound box service field of each user is performed in at least one of the following manners:
determining personalized language knowledge of the user according to the shopping data of the user;
and determining the personalized language knowledge of the user according to the text information input by the user.
Optionally, the method further includes:
and updating the language knowledge base of the user according to the interactive voice data.
The application also provides a voice interaction method, which comprises the following steps:
collecting voice data of a target user;
sending the voice data to a server side so that the server side can generate an individualized language knowledge base including language knowledge of at least one loudspeaker box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The application also provides a voice interaction method, which comprises the following steps:
determining an individualized language knowledge base of a user, wherein the individualized language knowledge base comprises language knowledge in the service field of at least one sound box;
aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base;
and executing voice interaction processing according to the text sequence.
The application also provides a television program playing method, which comprises the following steps:
determining a personalized program playing language knowledge base of a user;
aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base;
and playing the target program object according to the target program name.
Optionally, the method further includes:
and taking the program name, the actor name and/or the director name of the historical playing program object of the user as the personalized program playing language knowledge of the user.
Optionally, the playing the target program object according to the target program name includes:
determining a television channel and playing time corresponding to the target program name according to a program list;
determining a target program object according to the playing time and the television channel;
and playing the target program object.
Optionally, the determining a target program object according to the playing time and the television channel includes:
displaying a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times;
and taking the program object specified by the user as a target program object.
Optionally, the method further includes:
if the program list does not comprise the target program name, determining the program name related to the target program name;
displaying the related program name;
and if the user specifies to play the related program object, playing the related program object.
The application also provides a television program playing method, which comprises the following steps:
determining a target program name corresponding to user program playing voice instruction data acquired by the intelligent television;
determining a target program object corresponding to the target program name according to a program list;
and playing the target program object.
Alternatively to this, the first and second parts may,
the determining the target program object corresponding to the target program name according to the program list comprises:
determining a historical target program object corresponding to the target program name according to a historical program list;
and determining a current target program object corresponding to the target program name according to the current program list.
The application also provides a television program playing method, which comprises the following steps:
the server side determines a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the intelligent television;
determining a target program object corresponding to the target program name according to a program list;
and playing the target program object through the intelligent television.
The application also provides a television program playing method, which comprises the following steps:
the intelligent television collects program playing voice instruction data of a user;
sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list;
and playing the target program object.
The application also provides a conference recording method, which comprises the following steps:
constructing a language knowledge base of the conference field;
collecting conference voice data;
and determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record.
Optionally, the method further includes:
and determining a conference field corresponding to the conference voice data.
The application also provides a conference recording method, which comprises the following steps:
constructing a language knowledge base of each field;
determining the field of a target conference aiming at voice data of the target conference sent by terminal equipment;
and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
The application also provides a conference recording method, which comprises the following steps:
collecting voice data of a target conference;
sending the voice data to a server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
The application also provides a speech recognition model construction method, which comprises the following steps:
determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information;
constructing a network structure of an end-to-end voice recognition model;
and learning the speech recognition model from the training data set.
Optionally, the model comprises a voice coder, a decoder, a language knowledge coder and a classifier.
Optionally, the model includes: the system comprises a sound coder, a decoder, a language knowledge coder, a language model, a feature fusion module and a classifier.
Optionally, the model includes: the system comprises a voice coder, a language model, a language knowledge coder, a pointer scoring model, a feature fusion module and a classifier.
The application also provides a voice recognition method, which comprises the following steps:
constructing a personalized language knowledge base of the user, wherein the personalized language knowledge base comprises at least one domain language knowledge;
and aiming at user voice data collected by terminal equipment, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base.
The present application further provides a voice interaction apparatus, including:
the knowledge base construction unit is used for constructing an individual language knowledge base of each user, wherein the individual language knowledge base comprises at least one sound box service field language knowledge;
the voice recognition unit is used for determining a text sequence corresponding to voice data of a target user sent by the intelligent sound box through an end-to-end voice recognition model and the language knowledge base of the target user;
and the instruction processing unit is used for executing voice interaction processing according to the text sequence.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: constructing an individualized language knowledge base including language knowledge of at least one sound box service field of each user; aiming at voice data of a target user sent by an intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The present application further provides a voice interaction apparatus, including:
the voice data acquisition unit is used for acquiring voice data of a target user;
the voice data sending unit is used for sending the voice data to the server so that the server can generate an individualized language knowledge base including language knowledge of at least one sound box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The application further provides an intelligent sound box, include:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a target user; sending the voice data to a server side so that the server side can generate an individualized language knowledge base including language knowledge of at least one loudspeaker box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a target user; sending the voice data to a server side so that the server side can generate an individualized language knowledge base including at least one domain language knowledge of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The present application further provides a speech recognition model building apparatus, including:
a data preparation unit for determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information;
the network construction unit is used for constructing a network structure of an end-to-end voice recognition model;
and the network training unit is used for learning the voice recognition model from the training data set.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the speech recognition model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information; constructing a network structure of an end-to-end voice recognition model; and learning the speech recognition model from the training data set.
The present application further provides a voice interaction apparatus, including:
a knowledge base construction unit for determining an individualized language knowledge base of a user including at least one domain language knowledge;
the voice recognition unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base aiming at the collected voice data of the user;
and the instruction processing unit is used for executing voice interaction processing according to the text sequence.
The application further provides an intelligent sound box, include:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: determining an individualized language knowledge base of a user, wherein the individualized language knowledge base comprises language knowledge in the service field of at least one sound box; aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base; and executing voice interaction processing according to the text sequence.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: determining a personalized linguistic knowledge base of the user including at least one domain linguistic knowledge; aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base; and executing voice interaction processing according to the text sequence.
The present application further provides a television program playing device, including:
the knowledge base determining unit is used for determining the personalized program playing language knowledge base of the user by the smart television;
the program identification unit is used for determining a target program name corresponding to the voice instruction data through an end-to-end voice identification model and the language knowledge base aiming at the collected voice instruction data of the program playing of the user;
and the program playing unit is used for playing the target program object according to the target program name.
The application also provides a smart television, including:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: determining a personalized program playing language knowledge base of a user; aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base; and playing the target program object according to the target program name.
The present application further provides a television program playing device, including:
the program name recognition unit is used for determining a target program name corresponding to the program playing voice instruction data of the user, which is acquired by the intelligent television;
the program object determining unit is used for determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object.
The application also provides a smart television, including:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: determining a target program name corresponding to user program playing voice instruction data acquired by the intelligent television; determining a target program object corresponding to the target program name according to a program list; and playing the target program object.
The present application further provides a television program playing device, including:
the program name recognition unit is used for determining a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the intelligent television;
the program object determining unit is used for determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object through the intelligent television.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: determining a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the smart television; determining a target program object corresponding to the target program name according to a program list; and playing the target program object through the intelligent television.
The present application further provides a television program playing device, including:
the voice instruction acquisition unit is used for acquiring program playing voice instruction data of a user by the intelligent television;
the voice instruction sending unit is used for sending the voice instruction data to the server so that the server can determine a target program name corresponding to the voice instruction data conveniently; determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object.
The application also provides a smart television, including:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: collecting program playing voice instruction data of a user; sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list; and playing the target program object.
The present application further provides a conference recording apparatus, including:
the knowledge base construction unit is used for constructing a language knowledge base of the conference field;
the voice data acquisition unit is used for acquiring conference voice data;
and the voice transcription unit is used for determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base in the conference field to form a conference record.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the conference recording method, and after the equipment is powered on and the program for realizing the conference recording method is run by the processor, the following steps are executed: constructing a language knowledge base of the conference field; collecting conference voice data; and determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record.
The present application further provides a conference recording apparatus, including:
the knowledge base construction unit is used for constructing language knowledge bases of all fields;
the conference domain determining unit is used for determining the domain to which the target conference belongs according to the voice data of the target conference sent by the terminal equipment;
and the voice transcription unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the conference recording method, and after the equipment is powered on and the program for realizing the conference recording method is run by the processor, the following steps are executed: constructing a language knowledge base of each field; determining the field of a target conference aiming at voice data of the target conference sent by terminal equipment; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
The present application further provides a conference recording apparatus, including:
the voice data acquisition unit is used for acquiring voice data of the target conference;
the voice data sending unit is used for sending the voice data to the server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the conference recording method, and after the equipment is powered on and the program for realizing the conference recording method is run by the processor, the following steps are executed: collecting voice data of a target conference; sending the voice data to a server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
The present application further provides a speech recognition apparatus, including:
the knowledge base construction unit is used for constructing an individualized language knowledge base including at least one domain language knowledge of a user;
and the model prediction unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base aiming at the user voice data collected by the terminal equipment.
The present application further provides an electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice recognition method, and after the equipment is powered on and the program for realizing the voice recognition method is run by the processor, the following steps are executed: constructing a personalized language knowledge base of the user, wherein the personalized language knowledge base comprises at least one domain language knowledge; and aiming at user voice data collected by terminal equipment, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base.
The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.
The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.
Compared with the prior art, the method has the following advantages:
according to the voice interaction system provided by the embodiment of the application, the voice data of a target user are collected through the intelligent sound box, and the voice data are sent to the server side; the server generates an individualized language knowledge base including language knowledge of at least one sound box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence. By adopting the processing mode, at least the following technical effects are achieved:
1) aiming at the problem that the sub-language model needs to be retrained when the language knowledge is updated in the traditional technology, so that more time is spent, and the real-time performance of voice recognition is influenced.
2) Aiming at the sentence pattern dependence problem existing in the traditional technology, such as ' calling for ' Zhengcatalhao ' sentence pattern, the system provided by the application is built on the basis of the language knowledge base formed by vocabularies and does not depend on the sentence pattern, and the vocabularies in a certain sound box skill can be applied to the voice recognition of other sound box skills, so the accuracy of the voice recognition can be effectively improved.
3) The model that the system provided by the application relies on is end-to-end model, avoids the error propagation problem that the traditional non-end-to-end model exists, and therefore the accuracy of speech recognition can be effectively improved.
4) Aiming at the problem that the sub-language model is only effective to the corresponding loudspeaker box skills in the traditional technology, the system provided by the application is constructed on the basis of the language knowledge base formed by words, the knowledge base can comprise personalized language knowledge of a plurality of loudspeaker box skills, and one loudspeaker box skill can refer to the personalized language knowledge of other loudspeaker box skills, so that the accuracy of voice recognition can be effectively improved.
The speech recognition model construction method provided by the embodiment of the application determines a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information; constructing a network structure of an end-to-end voice recognition model; learning from a training data set to obtain the speech recognition model; this way of processing allows the model to introduce a priori language knowledge personalized to the user; therefore, the model accuracy can be effectively improved. Meanwhile, the model is irrelevant to the sentence pattern, so that the expandability of the speech recognition can be effectively improved. In addition, the model can perform voice recognition according to a real-time updated language knowledge base; therefore, the real-time performance of the voice recognition can be effectively improved. Meanwhile, the model can perform voice recognition processing for various speaker skills; therefore, the expandability of the voice recognition can be effectively improved. In addition, the end-to-end model has no error propagation problem, so that the accuracy of speech recognition can be effectively improved.
According to the television program playing method provided by the embodiment of the application, the personalized program playing language knowledge base of a user is determined through the smart television; aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base; playing a target program object according to the target program name; the processing mode can update the content of the program on demand knowledge base in real time and perform the program on demand voice recognition according to the real-time updated knowledge base, so that the higher real-time performance and accuracy of the voice recognition can be effectively considered. Meanwhile, the model depended on by the method is an end-to-end model, so that the error propagation problem existing in the non-end-to-end model can be avoided, and the accuracy of voice recognition can be effectively improved.
According to the television program playing method provided by the embodiment of the application, the target program name corresponding to the program playing voice instruction data is determined through the smart television; determining a target program object corresponding to the target program name according to a program list; playing the target program object; the processing mode can automatically play the corresponding program object according to the program playing voice command of the user, avoid the channel switching of the user one by one and search the interested television program in a manual mode; therefore, the user experience can be effectively improved, and the computing resources of the program server side can be saved.
The conference recording method provided by the embodiment of the application constructs a language knowledge base in the conference field; collecting conference voice data; determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record; the processing mode can perform conference voice recognition according to the language knowledge base in the conference domain, and avoids the problem that the sub-language models in each conference domain in the traditional technology need to be retrained when updating the language knowledge, so that much time is spent and the real-time performance of voice recognition is influenced; therefore, higher real-time performance and accuracy of voice recognition can be effectively considered. Meanwhile, the model depended on by the method is an end-to-end model, so that the error propagation problem existing in the non-end-to-end model can be avoided, and the accuracy of voice recognition can be effectively improved.
Drawings
FIG. 1 is a schematic structural diagram of an embodiment of a voice interaction system provided by the present application;
FIG. 2 is a schematic diagram of a scenario of an embodiment of a voice interaction system provided by the present application;
FIG. 3 is a schematic diagram of device interaction of an embodiment of a voice interaction system provided by the present application;
FIG. 4 is a schematic diagram of an end-to-end speech recognition model of an embodiment of a speech interaction system provided by the present application;
FIG. 5 is a schematic diagram of another end-to-end speech recognition model of an embodiment of a speech interaction system provided by the present application;
FIG. 6 is a model diagram illustrating a voice interaction system according to an embodiment of the present application;
FIG. 7 is a diagram illustrating an indicator scoring model according to an embodiment of the speech interaction system provided by the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The application provides a voice interaction system, a voice interaction method and a voice interaction device, a television program playing method and a television program playing device, a conference recording method and a conference recording device, a voice recognition model construction method and a voice recognition model construction device, an intelligent sound box, an intelligent television and electronic equipment. Each of the schemes is described in detail in the following examples.
First embodiment
Please refer to fig. 1, which is a diagram illustrating a voice interaction system according to an embodiment of the present application. The voice interaction system provided by the embodiment comprises: server 1 and intelligent audio amplifier 2.
The server 1 may be a server deployed on a cloud server, or may be a server dedicated to implementing a voice interaction system, and may be deployed in a data center.
The smart speaker 2 may be a tool for a home consumer to surf the internet by voice, such as ordering songs, shopping on the internet, or knowing weather forecast, and may also control smart home devices, such as opening a curtain, setting a temperature of a refrigerator, raising a temperature of a water heater in advance, and the like.
Please refer to fig. 2, which is a schematic view of a voice interaction system according to the present application. The server 1 and the smart speaker 2 can be connected via a network, for example, the smart speaker 2 can be networked via WIFI, and the like. The user interacts with the intelligent sound box in a voice mode. A user issues a voice instruction (such as turning on a rabbit lamp, or calling a certain person, etc.) to the intelligent sound box 2, and the intelligent sound box sends user voice data to the server; the server determines the text sequence of the voice data through a voice recognition technology; and executing voice interaction processing according to the text sequence obtained by decoding.
Please refer to fig. 3, which is a schematic diagram of a voice interaction system according to the present application. In this embodiment, the smart sound box is configured to collect voice data of a target user and send the voice data to a server; the server is used for constructing an individualized language knowledge base including language knowledge in the service field of at least one sound box of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The system provided by the embodiment of the application can build a proprietary language knowledge base for each user, such as the knowledge base of the user A and the knowledge base of the user B. The personalized language knowledge base comprises words of at least one sound box service field of a user, and the words can be irrelevant to the sentence patterns without the problem of sentence pattern dependence.
In specific implementation, a special language knowledge base can be established for each sound box service, such as a language knowledge base of 'song-on' skill, a language knowledge base of 'you guess i' skill, and a language knowledge base of 'call-on' skill. By adopting the processing mode, different speaker skills can have different language knowledge, for example, the language knowledge base of the song requesting skill of the user A comprises the vocabulary memorial, and the language knowledge base of the phone calling skill of the user A comprises the name commemoration, so that the voice recognition accuracy of each speaker skill can be improved, and the speaker skills cannot interfere with each other.
In specific implementation, a language knowledge base common to multiple speaker services can be constructed, so that the language knowledge base includes words of multiple speaker service fields, such as "herbo right inclined," "catalpa hao," "lover knot," and the like; by the processing mode, the knowledge base can comprise the personalized language knowledge of a plurality of loudspeaker box skills, and one loudspeaker box skill can refer to the personalized language knowledge of other loudspeaker box skills, so that the accuracy of voice recognition, particularly the accuracy of voice recognition of new skills, can be effectively improved.
The vocabulary in the language knowledge base includes but is not limited to: the long-tail entity words (e.g. the book name of Han Fei, the weight of the Fei, the lower of the sky), the entity words of the reverse language model (e.g. the name of the music song, "lover knot"), the entity words of the homophone with different characters (e.g. "catalpa hao" vs "sub-luxury"), and the entity words included in the context during the voice interaction (e.g. "horns").
In one example, the speaker service domain includes: the field of telephone service; the language knowledge of the calling service domain includes: names in the user address list, such as catalpa; the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps: receiving user address book information sent by the intelligent sound box corresponding to the user; and taking the name in the user address list as the personalized language knowledge of the user. By adopting the processing mode, the language knowledge base comprises the name information of the address list, so that the call object can be more accurately determined; therefore, the conversation accuracy can be effectively improved.
In one example, the speaker service domain includes: the field of question and answer services, such as the skill of "think me guess" of the demon of the Temple; the language knowledge in the field of question-answering services comprises: a text segment in context; the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps: determining a sequence of contextual text corresponding to the user's contextual phonetic data, such as between mao tianjingling to "how it has angles"; and taking the text segments in the context text sequence as personalized language knowledge of the user, such as the word 'angular'. By adopting the processing mode, the language knowledge base comprises the context entity information in the question and answer process, so that the answer text can be more accurately determined; therefore, the question answering accuracy can be effectively improved, and the question answering process can be smoothly carried out.
In one example, the speaker service domain includes: multimedia playing service fields, such as the "song order" skill of the tianmao elfin; the language knowledge in the multimedia playing service field comprises: the name of the song; the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps: determining the historical playing program name of the user; and taking the historical playing program name as the personalized language knowledge of the user. By adopting the processing mode, the language knowledge base comprises song names, movie names and the like, so that programs which the user wants to play can be more accurately determined; therefore, the program playing accuracy can be effectively improved, and the user experience is improved.
In one example, the personalized language knowledge base including language knowledge of at least one loudspeaker service field of each user is constructed by at least one of the following methods: determining the personalized language knowledge of the user according to the shopping data of the user, and if the user A inquires the logistics state of an online order through a sound box, taking the trade name, brand name, address name and the like related to the order as the personalized language knowledge of the user; and determining the personalized language knowledge of the user according to the text information input by the user, and if the user A inputs a rabbit lamp on the sound box with the screen, taking the word as the personalized language knowledge of the user. By adopting the processing mode, richer personalized language knowledge of the user can be obtained; therefore, the accuracy of the voice recognition can be effectively improved.
In one example, the server may be further configured to update the language knowledge base of the user according to the interactive voice data, such as adding "jinhaixin" spoken by the user to the knowledge base, or adding "horns" spoken by the tianmao sprite to the knowledge base. By adopting the processing mode, the knowledge of the language knowledge base is updated in real time, so that the accuracy of voice recognition can be effectively improved.
The End-to-End (End2End) speech recognition model converts speech signals into texts, and adopts a speech recognition framework which combines an acoustic model and a language model into a whole, so that the error propagation effect between the modules does not exist, and the accuracy of speech recognition can be improved.
In one example, the end-to-end speech recognition model may be a clas (listen Attend and speech) model as shown in fig. 4, and may include a voice coder, a decoder, a language knowledge coder, and a classifier. In this speech recognition model, the language model is not explicitly visible. In order to introduce a priori knowledge (personalized language knowledge of a user) to limit the search range of the model to a proper range, relevant errors need to be introduced, and a special symbol (such as $) can be introduced during implementation. In this embodiment, if the speech recognition model reads a phrase from the Memory, a special symbol "$" follows the corresponding phrase at the decoding output, so that the decoder error is introduced and normal decoding is not performed.
The system provided by the embodiment of the present application may also adopt an end-to-end speech recognition model with other structures besides fig. 4, such as the models of fig. 5 or fig. 6. In this embodiment, the server is specifically configured to determine, through a language model included in the speech recognition model, a first text feature corresponding to the speech data; determining a second text feature corresponding to the voice data according to the language knowledge base and the first text feature; determining the text sequence based at least on the second text feature.
The first text feature includes a text feature corresponding to the speech data determined by the language model. For example, the first textual feature of the pronunciation "tu tu tu" includes the textual feature of "native rabbit"; the first text feature of the pronunciation "ji nian" includes the text feature of "commemoration".
The second text feature includes a text feature corresponding to the speech data determined from the language knowledge base and the first text feature. For example, the linguistic knowledge base includes the word "rabbit lamp", and the second textual feature of the pronunciation "tu tu tu" includes the textual feature of "rabbit"; the language knowledge base includes the word "memorial" and the second text feature of the pronunciation "ji nian" includes the text feature of "memorial".
In one example, the end-to-end speech recognition model may be the model shown in FIG. 5, including a vocoder, a decoder, a linguistic knowledge coder, a linguistic model, a feature fusion module, and a classifier. The speech recognition model comprises an explicit language model, and the speech recognition model can perform feature fusion processing on features output by the language model, features output by a decoder and features output by a language knowledge encoder through a feature fusion module to determine features of a text sequence corresponding to user speech data, and then determine the text sequence corresponding to the user speech data through a classifier according to the fused text features.
As shown in fig. 5, the end-to-end speech recognition model includes a language model, which may be a general language model. Thus, in one aspect, the language model may be trained based on rich linguistic data, and the training samples may typically include common linguistic data, and may not include personalized linguistic knowledge of the user, such as "herno who has a high tail entity," beneath the charpy of the herno, "lover knots" of the anti-language model, and "catalpa hao" of different words with the same tone. By adopting the processing mode, the personalized language knowledge of the user is not required to be introduced into the voice recognition by depending on a language model, the personalized language knowledge of the user does not influence the language model, when the user has new personalized language knowledge, the language model is not required to be retrained according to the new knowledge, the newly added personalized language knowledge is added into the personalized language knowledge base of the user, the personalized language knowledge updated in real time of the user is introduced into the voice recognition model through a language knowledge encoder, and the text is determined through the voice recognition model; therefore, high real-time performance and accuracy of voice recognition can be effectively considered. On the other hand, compared with the model shown in fig. 4, the model shown in fig. 5 can be trained according to rich corpus to obtain a language model with higher accuracy; therefore, the speech recognition accuracy can be effectively improved.
In another example, the end-to-end speech recognition model may be the model shown in FIG. 6, including a vocoder, a language model, a knowledge of language coder, a pointer scoring model, a feature fusion module, and a classifier. Explicit language models are also included in the speech recognition model. The server is specifically configured to determine the text sequence according to the first text feature and the second text feature. As shown in fig. 6, the server may encode the vocabulary in the language knowledge base through the language knowledge encoder, and store the encoding result in an External Memory (External Memory); for the voice data to be decoded, the acoustic characteristics of the input data of the model can be encoded through a voice encoder included in the voice recognition model, and the encoding result can be stored in a memory and used as the input data of the explicit language model; then, determining a first text characteristic such as the characteristic of opening the soil rabbit according to the coding result through a language model; then, determining a second text characteristic, such as the characteristic of opening a rabbit, according to the language knowledge in the external memory and the first text characteristic through an indicator scoring model; then, two paths of input are fused through a feature fusion module included by the voice recognition model: the first text feature and the second text feature, and the module can connect (associate) the two features or add the two features, and the like, for example, the output of the feature fusion module includes a feature of opening rabbit; and then, determining the text sequence according to the fused text characteristics. By adopting the processing mode, the text sequence can be determined according to the first text characteristic and the second text characteristic; therefore, the speech recognition accuracy can be effectively improved.
In this embodiment, the server determines, by using an indicator scoring model, a second text feature corresponding to the speech data according to the language knowledge base and the first text feature, and may be implemented as follows: determining the relevancy between the word corresponding to the first text characteristic and each language knowledge; determining a second text feature based at least on linguistic knowledge associated with words having a relevance greater than a relevance threshold. By adopting the processing mode, when the priori knowledge is introduced, special symbols are not required to be added, so that the training data of the language model is not changed, and better decoding is realized in a light weight manner; therefore, the accuracy of the voice recognition can be effectively improved.
In specific implementation, for each word in the text with the first text characteristic, determining the correlation between the word in the text segment ended by the current word and each language knowledge, such as the correlation between the soil rabbit and the rabbit light, according to the first text characteristic (such as the characteristic of opening the soil rabbit) of the text segment ended by the current word; and if the correlation degree is larger than the correlation degree threshold value, determining a second text characteristic (such as a characteristic of opening a rabbit) of the text segment ending with the current word according to the related language knowledge (such as rabbit lamps).
As shown in fig. 7, at each step in the decoding process, the M candidate vocabulary sets (personalized language knowledge vocabularies) in the External Memory (External Memory) may be subjected to two classifications, e.g., related or unrelated. As can be seen from fig. 7, the scoring model includes a correlation matrix of B × T × M, T may be a matrix, T matrix elements include codes of words corresponding to the first text feature, M may also be a matrix, and M matrix elements include codes of words in the knowledge base. For example, if the text processed in the first step is decoded to "type", the correlation degree between the word and each word in the knowledge base is determined, and no word with the correlation degree larger than the correlation degree threshold value is found, so that the corresponding second text characteristic comprises the feature of the "type" word, and if the text processed in the fourth step is decoded to "open soil rabbit", the correlation degree between each word and each word in the knowledge base is determined, wherein the correlation degree between the "soil rabbit" and the "rabbit lamp" is larger than the correlation degree threshold value, so that the corresponding second text characteristic comprises the feature of "open rabbit".
As shown in FIG. 6, the system can implement an end-to-end speech recognition model based on a full Transformer architecture, the ASR model includes an unmodified language model, no special symbol concept exists in the language model, and the special symbol is introduced without changing training data, so that the effect of the language model is not influenced. Meanwhile, under the condition of not modifying the language model, the prior knowledge of the personalized language knowledge of the user is introduced in a light-weighted mode through the indicator scoring model, and the searching range of the model is limited to a proper range; and this feature is implemented end-to-end on the architecture of a fully neural network. Therefore, better decoding can be achieved with reduced weight.
In specific implementation, in order to introduce a priori knowledge, a correlation error needs to be introduced, and a special symbol (such as $) can be used on the basis of the model shown in fig. 5. By adopting the processing mode, the training data of the language model can be changed when the priori knowledge is introduced, so that $canrandomly appear in the recognized sentence.
In a specific implementation, the model shown in fig. 5 or fig. 6 is not based on a special symbol, but the prior knowledge is introduced through an indicator scoring model. And determining a second text characteristic through an indicator scoring model by combining the language knowledge in the external memory and the first text characteristic, wherein the first text characteristic is a characteristic of turning on a soil rabbit, the knowledge base comprises a characteristic of turning on a rabbit lamp, and the second text characteristic is a characteristic of turning on a rabbit. By adopting the processing mode, under the condition that a main model (including a language model and the like) can be trained well, the voice recognition model can be ensured to read information from an external memory certainly, namely, the Loss of a language knowledge coder biasEncoder is introduced into the Loss of the whole decoding Decoder in the design of the Loss value Loss.
In this embodiment, the server is further configured to perform encoding processing on the language knowledge through a language knowledge encoder included in the speech recognition model; storing the coded data of the language knowledge to a language knowledge storage module (namely the external memory) included in the voice recognition model; and is specifically configured to determine, by an indicator scoring model, a second text feature based on the coded data and the first text feature stored by the language knowledge storage module.
And after the server determines the text sequence corresponding to the voice data, the server can execute voice interaction processing according to the text sequence. For example, in a question-answering scene (e.g., you guess me), reply information can be determined according to information provided by a user, and sent to the smart speaker to be displayed to the user, such as replying to "horns"; or, in the on-demand scene, the song data requested by the user can be sent to the smart sound box for playing, such as "memorial" playing; or, in a call scene, the smart sound box can be connected with a target through human communication equipment, such as a mobile phone which dials "catalpa".
The process of the end-to-end speech recognition model prediction phase is now described. The processing method in the training phase of the model will be described below.
In the end-to-end speech recognition model training stage, the server is used for determining a training data set; the training data may include: the voice data, the personalized language knowledge base and the text sequence labeling information can also comprise information on whether the personalized language knowledge is effective or not and the like; the model is learned from the training data set. When the model is trained, acoustic features of voice data can be used as input data of the model, text sequence labeling information is used as output data, an M matrix of the indicator scoring model is built based on an individualized language knowledge base, and the model is obtained by centralized learning of training data in a machine learning mode.
In one example, the personalized language knowledge base in the training data is determined as follows: and automatically constructing the personalized language knowledge base according to the text sequence marking information of the plurality of training data. By adopting the processing mode, a large amount of training data can be generated without additional data; therefore, the construction efficiency of the end-to-end speech recognition model can be effectively improved.
In specific implementation, the server is also used for constructing a language knowledge encoder. The input data of the language knowledge coder can be texts of a group of phrases, and the output data is vectorization codes of corresponding phrases; the codes contain information about the words; the knowledge information can help the decoder module to accurately output texts, such as ' lover ' knot ', ' open rabbit ', ' catalpa ' and the like. In this embodiment, the training data includes speech and text corresponding thereto, so that text segments can be randomly intercepted from the text as phrases and further as input to the language knowledge encoder; these phrases, in combination with the corresponding speech signal, may help the decoder to decode the text more accurately. During the training process, the decoder learns how to select relevant knowledge from the knowledge base to assist decoding; the whole training/learning process is end-to-end, and no additional training data is needed.
As can be seen from the above embodiments, the voice interaction system provided by the embodiment of the application acquires the voice data of the target user through the smart sound box, and sends the voice data to the server; the server generates an individualized language knowledge base including language knowledge of at least one sound box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence. By adopting the processing mode, at least the following technical effects are achieved:
1) aiming at the problem that the sub-language model needs to be retrained when the language knowledge is updated in the traditional technology, so that more time is spent, and the real-time performance of voice recognition is influenced.
2) Aiming at the sentence pattern dependence problem existing in the traditional technology, such as ' calling for ' Zhengcatalhao ' sentence pattern, the system provided by the application is built on the basis of the language knowledge base formed by vocabularies and does not depend on the sentence pattern, and the vocabularies in a certain sound box skill can be applied to the voice recognition of other sound box skills, so the accuracy of the voice recognition can be effectively improved.
3) The model that the system provided by the application relies on is end-to-end model, avoids the error propagation problem that the traditional non-end-to-end model exists, and therefore the accuracy of speech recognition can be effectively improved.
4) Aiming at the problem that the sub-language model is only effective to the corresponding loudspeaker box skills in the traditional technology, the system provided by the application is constructed on the basis of the language knowledge base formed by words, the knowledge base can comprise personalized language knowledge of a plurality of loudspeaker box skills, and one loudspeaker box skill can refer to the personalized language knowledge of other loudspeaker box skills, so that the accuracy of voice recognition can be effectively improved.
Second embodiment
In the foregoing embodiments, a voice interaction system is provided, and correspondingly, the present application also provides a voice interaction method, where an execution subject of the method may be a server, or any device capable of implementing the method. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
In this embodiment, the method may include the steps of:
step 1: and constructing an individualized language knowledge base including language knowledge of at least one loudspeaker box service field of each user.
The personalized language knowledge base includes but is not limited to: long-tail entity words, entity words of a reverse language model, entity words of homophones and characters, and entity words in context.
In one example, the speaker service domain includes: the field of telephone service; the language knowledge of the calling service domain includes: the name of the user in the address list; in this case, step 1 may comprise the following sub-steps: 1.1) receiving user address book information sent by an intelligent sound box corresponding to the user; 1.2) taking the name in the address list of the user as the personalized language knowledge of the user.
In one example, the speaker service domain includes: the field of question and answer services; the language knowledge in the field of question-answering services comprises: a text segment in context; in this case, step 1 may comprise the following sub-steps: 1.3) determining a context text sequence; 1.4) taking the text segment in the context text sequence as the personalized language knowledge of the user.
In one example, the speaker service domain includes: the field of multimedia playing services; the language knowledge in the multimedia playing service field comprises: the name of the song; in this case, step 1 may comprise the following sub-steps: 1.5) determining the historical playing program name of the user; 1.6) taking the historical playing program name as the personalized language knowledge of the user.
In one example, step 1 may be at least one of the following: 1.7) determining personalized language knowledge of the user according to the shopping data of the user; 1.8) determining the personalized language knowledge of the user according to the text information input by the user.
In one example, the method may further comprise the steps of: and updating the language knowledge base of the user according to the interactive voice data.
Step 2: and aiming at the voice data of the target user sent by the intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user.
In one example, where the speech recognition model comprises a language model, step 2 may comprise the sub-steps of: 2.1) determining a first text characteristic corresponding to the voice data through the language model; 2.2) determining a second text characteristic corresponding to the voice data according to the language knowledge base and the first text characteristic; 2.3) determining the text sequence at least according to the second text characteristic.
In one example, where the speech recognition model comprises an indicator score model, step 2.2 may be implemented as follows: determining, by the indicator scoring model, a second text feature corresponding to the speech data based on the linguistic knowledge base and the first text feature.
In one example, step 2.2 may comprise the sub-steps of: 2.2.1) determining the relevance of the word corresponding to the first text characteristic and each language knowledge through the indicator scoring model; determining a second text feature based at least on linguistic knowledge associated with words having a relevance greater than a relevance threshold.
In one example, the speech recognition model includes a language knowledge encoder; accordingly, the method may further comprise the steps of: performing, by the linguistic knowledge encoder, an encoding process on the linguistic knowledge; storing the coded data of the language knowledge to a language knowledge storage module included in the speech recognition model; accordingly, step 2.2 can be implemented as follows: and determining a second text characteristic according to the coded data and the first text characteristic stored by the language knowledge storage module.
And step 3: and executing voice interaction processing according to the text sequence.
In one example, the method may further comprise the steps of: determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information; and learning from the training data set to obtain the speech recognition model.
In one example, the personalized language knowledge base in the training data may be determined as follows: and constructing the personalized language knowledge base according to the text sequence marking information of the plurality of training data.
As can be seen from the above embodiments, the voice interaction method provided by the embodiments of the present application constructs an individualized language knowledge base including language knowledge of at least one sound box service field for each user; aiming at voice data of a target user sent by an intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence. By adopting the processing mode, at least the following technical effects are achieved:
1) aiming at the problem that the sub-language model needs to be retrained when the language knowledge is updated in the traditional technology, so that more time is spent, and the real-time performance of voice recognition is influenced, the method provided by the application can update the content of the knowledge base in real time and perform voice recognition according to the knowledge base updated in real time, so that higher real-time performance and accuracy of voice recognition can be effectively considered.
2) Aiming at the sentence pattern dependence problem existing in the traditional technology, such as the sentence pattern of ' calling for ' Zhengcatalhao ', the method provided by the application is built on the basis of a language knowledge base formed by vocabularies and does not depend on the sentence pattern, and the vocabularies in a certain sound box skill can be applied to the voice recognition of other sound box skills, so the accuracy of the voice recognition can be effectively improved.
3) The model depended by the method is an end-to-end model, and the problem of error propagation existing in the traditional non-end-to-end model is avoided, so that the accuracy of voice recognition can be effectively improved.
4) Aiming at the problem that the sub-language model is only effective to the corresponding loudspeaker box skills in the traditional technology, the method provided by the application is established on the basis of the language knowledge base formed by words, the knowledge base can comprise the personalized language knowledge of a plurality of loudspeaker box skills, and one loudspeaker box skill can refer to the personalized language knowledge of other loudspeaker box skills, so that the accuracy of voice recognition can be effectively improved.
Third embodiment
In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application further provides a voice interaction apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a voice interaction device includes:
the knowledge base construction unit is used for constructing an individual language knowledge base of each user, wherein the individual language knowledge base comprises at least one sound box service field language knowledge;
the voice recognition unit is used for determining a text sequence corresponding to voice data of a target user sent by the intelligent sound box through an end-to-end voice recognition model and the language knowledge base of the target user;
and the instruction processing unit is used for executing voice interaction processing according to the text sequence.
Fourth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: constructing an individualized language knowledge base including language knowledge of at least one sound box service field of each user; aiming at voice data of a target user sent by an intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
Fifth embodiment
In the foregoing embodiment, a voice interaction system is provided, and correspondingly, the present application also provides a voice interaction method, where an execution subject of the method may be a smart speaker, a smart television, a chat robot, or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The speech recognition method provided by the application can comprise the following steps:
step 1: collecting voice data of a target user;
step 2: sending the voice data to a server side so that the server side can generate an individualized language knowledge base including language knowledge of at least one loudspeaker box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
Sixth embodiment
In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application further provides a voice interaction apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a voice interaction device includes:
the voice data acquisition unit is used for acquiring voice data of a target user;
the voice data sending unit is used for sending the voice data to the server so that the server can generate an individualized language knowledge base including language knowledge of at least one sound box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
Seventh embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: collecting voice data of a target user; sending the voice data to a server side so that the server side can generate an individualized language knowledge base including at least one domain language knowledge of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
The electronic device can be a smart sound box, a smart television, a chat robot and the like.
In one example, the electronic device is a smart sound box, and the server generates a personalized language knowledge base including language knowledge of at least one sound box service field for each user.
Eighth embodiment
In the foregoing embodiment, a speech interaction system is provided, and correspondingly, the present application also provides a speech recognition model construction method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The method for constructing the speech recognition model provided by the application can comprise the following steps:
step 1: a training data set is determined.
The training data, including but not limited to: voice data, a personalized language knowledge base and text sequence labeling information.
Step 2: and constructing a network structure of the end-to-end voice recognition model.
As shown in FIG. 4, in one example, the model may employ a CLS network structure including a vocoder, a decoder, a linguistic knowledge coder, and may further include a classifier.
As shown in fig. 5, in one example, the model may include: the system comprises a sound coder, a decoder, a language knowledge coder, a language model, a feature fusion module and a classifier.
As shown in fig. 6, in one example, the model may include: the system comprises a voice coder, a language model, a language knowledge coder, a pointer scoring model, a feature fusion module and a classifier.
And step 3: and learning the speech recognition model from the training data set.
As can be seen from the foregoing embodiments, the speech recognition model construction method provided in the embodiments of the present application determines a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information; constructing a network structure of an end-to-end voice recognition model; learning from a training data set to obtain the speech recognition model; this way of processing allows the model to introduce a priori language knowledge personalized to the user; therefore, the model accuracy can be effectively improved. Meanwhile, the model is irrelevant to the sentence pattern, so that the expandability of the speech recognition can be effectively improved. In addition, the model can perform voice recognition according to a real-time updated language knowledge base; therefore, the real-time performance of the voice recognition can be effectively improved. Meanwhile, the model can perform voice recognition processing for various speaker skills; therefore, the expandability of the voice recognition can be effectively improved. In addition, the end-to-end model has no error propagation problem, so that the accuracy of speech recognition can be effectively improved.
Ninth embodiment
In the foregoing embodiment, a speech recognition model construction method is provided, and correspondingly, the present application also provides a speech recognition model construction device. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a speech recognition model construction device, includes:
a data preparation unit for determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information;
the network construction unit is used for constructing a network structure of an end-to-end voice recognition model;
and the network training unit is used for learning the voice recognition model from the training data set.
Tenth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a speech recognition model building method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information; constructing a network structure of an end-to-end voice recognition model; and learning the speech recognition model from the training data set.
Eleventh embodiment
In the foregoing embodiment, a voice interaction system is provided, and correspondingly, the present application also provides a voice interaction method, where an execution subject of the method may be a smart speaker or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The voice interaction method provided by the application comprises the following steps:
step 1: determining a personalized linguistic knowledge base of the user including at least one domain linguistic knowledge;
step 2: aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base;
and step 3: and executing voice interaction processing according to the text sequence.
The difference between the method provided by this embodiment and the system provided by the first embodiment includes that the method provided by this embodiment can be used for an intelligent sound box terminal, and the voice recognition processing is completed on the terminal, and the cooperation of a server terminal is not needed.
Twelfth embodiment
In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application further provides a voice interaction apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a voice interaction device includes:
a knowledge base construction unit for determining an individualized language knowledge base of a user including at least one domain language knowledge;
the voice recognition unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base aiming at the collected voice data of the user;
and the instruction processing unit is used for executing voice interaction processing according to the text sequence.
Thirteenth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the voice interaction method, the device being powered on and the program for implementing the method being executed by the processor for performing the steps of: determining a personalized linguistic knowledge base of the user including at least one domain linguistic knowledge; aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base; and executing voice interaction processing according to the text sequence.
Fourteenth embodiment
In the foregoing embodiment, a voice interaction system is provided, and correspondingly, the present application further provides a television program playing method, where the implementation subject of the method includes but is not limited to: smart televisions, television remote controllers, and the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The television program playing method provided by the application comprises the following steps:
step 1: and determining a personalized program playing language knowledge base of the user.
The linguistic knowledge base includes, but is not limited to: the television program name, the channel name, the actor name and other vocabularies related to the television program playing.
In one example, the method may further comprise the steps of: and taking the program name, the actor name and/or the director name of the historical playing program object of the user as the personalized program playing language knowledge of the user. In this manner, the vocabulary in the linguistic knowledge base may include vocabulary associated with television programs historically viewed by the user.
Step 2: and aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base.
The smart television can acquire program playing voice instruction data of a user through a remote controller and other devices, for example, the user sends a voice instruction 'i want to watch Harry Baud', and the program name 'Harry Baud' can be identified through the step 2. For a specific implementation of the speech recognition in this step, reference may be made to the relevant description of the first embodiment, and details are not described here.
After the target program name is determined, the next step of playing the related television program can be carried out.
And step 3: and playing the target program object according to the target program name.
In one example, step 3 may include the following sub-steps: 3.1) determining the television channel and the playing time related to the target program name according to the program list of each television channel (such as the program list of the last week and can comprise the program information played in the last week and the program information currently played); 3.2) determining a target program object according to the playing time and the television channel; 3.3) playing the target program object.
For example, in a case of using a cable television in a certain area, a user says "i want to watch Harry baud" to a remote controller, the remote controller first identifies that a program that the user wants to order or review is named as "Harry baud", then, according to a television program table that can be reviewed in one week, which channel and when the Harry baud is played can be searched, and if the channel is found, the viewable program object or the program object currently being played by the related channel can be played.
The following table shows a program table in the present embodiment.
Playing time Television channel Name of program
2020/6/1 19:00-19:30 Center 1 table News simulcast
2020/6/1 20:00-22:30 6 central stations Harry potter 1
2020/6/3 20:00-22:30 6 central stations Harry potter 2
As shown in the above table, the program object corresponding to the target program name determined in step 2 may include a plurality of program objects, for example, harry baud is played on both day 1 and day 3 of 6 months. In this case, step 3.2 may also comprise the following sub-steps: 3.2.1) displaying a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times; 3.2.2) taking the program object specified by the user as the target program object. By adopting the processing mode, all related program objects can be displayed on the television screen for the user to select, and then the target program object specified by the user is played. For example, if multiple re-viewable programs are found, such as many episodes of a television show, the re-viewable episodes are displayed to allow the user to select which episode he or she wants to view.
In specific implementation, step 3 may further include the following sub-steps: if the program list does not comprise the target program name, determining the program name related to the target program name; displaying the related program name; and if the user specifies to play the related program object, playing the related program object. By adopting the processing mode, if the program which is firstly watched by the user is not found, the relevant television program can be recommended to the user, for example, if the user wants to watch the recording sheet of 'western lake impression', but the recording sheet is not played in the last week, other programs related to the western lake, such as 'decryption of the temple', 'ten views of the western lake', 'the scene boat' and the like, can be played.
As can be seen from the foregoing embodiments, in the television program playing method provided in the embodiments of the present application, the personalized program playing language knowledge base of the user is determined by the smart television; aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base; playing a target program object according to the target program name; the processing mode can update the content of the program on demand knowledge base in real time and perform the program on demand voice recognition according to the real-time updated knowledge base, so that the higher real-time performance and accuracy of the voice recognition can be effectively considered. Meanwhile, the model depended on by the method is an end-to-end model, so that the error propagation problem existing in the non-end-to-end model can be avoided, and the accuracy of voice recognition can be effectively improved.
Fifteenth embodiment
In the foregoing embodiment, a method for playing a television program is provided, and correspondingly, a device for playing a television program is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a television program playing device includes:
the knowledge base determining unit is used for determining the personalized program playing language knowledge base of the user by the smart television;
the program identification unit is used for determining a target program name corresponding to the voice instruction data through an end-to-end voice identification model and the language knowledge base aiming at the collected voice instruction data of the program playing of the user;
and the program playing unit is used for playing the target program object according to the target program name.
Sixteenth embodiment
The application also provides an intelligent television. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
The smart television of this embodiment, this smart television includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: the intelligent television determines a personalized program playing language knowledge base of a user; aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base; and playing the target program object according to the target program name.
Seventeenth embodiment
In the foregoing embodiment, a television program playing method is provided, and correspondingly, the application further provides a television program playing method, and an execution main body of the method may be an intelligent television or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the parts of the fourteenth embodiment will not be described again, please refer to corresponding parts in the fourteenth embodiment.
The television program playing method provided by the application comprises the following steps:
step 1: determining a target program name corresponding to user program playing voice instruction data acquired by the intelligent television;
step 2: determining a target program object corresponding to the target program name according to a program list;
and step 3: and playing the target program object.
In one example, step 2 may include the following sub-steps: 2.1) determining a historical target program object corresponding to the target program name according to a historical program list of each television channel; 2.2) determining a current target program object corresponding to the target program name according to the current program list of each television channel.
As can be seen from the foregoing embodiments, in the television program playing method provided in the embodiments of the present application, a target program name corresponding to program playing voice instruction data is determined by a smart television; determining a target program object corresponding to the target program name according to a program list; playing the target program object; the processing mode can automatically play the corresponding program object according to the program playing voice command of the user, avoid the channel switching of the user one by one and search the interested television program in a manual mode; therefore, the user experience can be effectively improved, and the computing resources of the program server side can be saved.
Eighteenth embodiment
In the foregoing embodiment, a method for playing a television program is provided, and correspondingly, a device for playing a television program is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a television program playing device includes:
the program name recognition unit is used for determining a target program name corresponding to the program playing voice instruction data of the user, which is acquired by the intelligent television;
the program object determining unit is used for determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object.
Nineteenth embodiment
The application also provides an intelligent television. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
The smart television of this embodiment, this smart television includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: determining a target program name corresponding to user program playing voice instruction data acquired by the intelligent television; determining a target program object corresponding to the target program name according to a program list; and playing the target program object.
Twentieth embodiment
In the foregoing embodiment, a television program playing method is provided, and correspondingly, an execution subject of the method may be a server, and the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the seventeenth embodiment are not repeated, please refer to the corresponding parts in the seventeenth embodiment.
The television program playing method provided by the application comprises the following steps:
step 1: the server side determines a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the intelligent television;
step 2: determining a target program object corresponding to the target program name according to a program list;
and step 3: and playing the target program object through the intelligent television.
In specific implementation, the video stream of the target program object may be sent to the smart television for playing.
As can be seen from the foregoing embodiments, in the television program playing method provided in the embodiments of the present application, a target program name corresponding to voice instruction data is determined by a server for playing the voice instruction data of a user program acquired by an intelligent television; determining a target program object corresponding to the target program name according to a program list; playing the target program object through the smart television; the processing mode can automatically play the corresponding program object according to the program playing voice command of the user, avoid the channel switching of the user one by one and search the interested television program in a manual mode; therefore, the user experience can be effectively improved, and the computing resources of the program server side can be saved.
Twenty-first embodiment
In the foregoing embodiment, a method for playing a television program is provided, and correspondingly, a device for playing a television program is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a television program playing device includes:
the program name recognition unit is used for determining a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user, which is acquired by the intelligent television, of the server;
the program object determining unit is used for determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object through the intelligent television.
Twenty-second embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: determining a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the smart television; determining a target program object corresponding to the target program name according to a program list; and playing the target program object through the intelligent television.
Twenty-third embodiment
In the foregoing embodiment, a television program playing method is provided, and correspondingly, the application further provides a television program playing method, and an execution main body of the method may be an intelligent television or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the seventeenth embodiment are not repeated, please refer to the corresponding parts in the seventeenth embodiment.
The television program playing method provided by the application comprises the following steps:
step 1: the intelligent television collects program playing voice instruction data of a user;
step 2: sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list;
and step 3: and playing the target program object.
In specific implementation, the smart television can receive the video stream of the target program object sent by the server.
As can be seen from the foregoing embodiments, in the television program playing method provided in the embodiments of the present application, program playing voice instruction data of a user is collected by a smart television; sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list; playing the target program object; the processing mode can automatically play the corresponding program object according to the program playing voice command of the user, avoid the channel switching of the user one by one and search the interested television program in a manual mode; therefore, the user experience can be effectively improved, and the computing resources of the program server side can be saved.
Twenty-fourth embodiment
In the foregoing embodiment, a method for playing a television program is provided, and correspondingly, a device for playing a television program is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a television program playing device includes:
the voice instruction acquisition unit is used for acquiring program playing voice instruction data of a user by the intelligent television;
the voice instruction sending unit is used for sending the voice instruction data to the server so that the server can determine a target program name corresponding to the voice instruction data conveniently; determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object.
Twenty-fifth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a television program broadcasting method, wherein the following steps are executed after the device is powered on and the program of the method is run by the processor: the intelligent television collects program playing voice instruction data of a user; sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list; and playing the target program object.
Twenty-sixth embodiment
In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the application also provides a conference recording method, where an execution subject of the method may be an electronic device deployed in a conference site, such as a trial court all-in-one machine. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the second embodiment will not be described again, please refer to corresponding parts in embodiment two.
The conference recording method provided by the application comprises the following steps:
step 1: and constructing a language knowledge base of the conference field.
The conference field can be various application fields, such as a computer field, a medical field, a legal field, a patent field and the like. In specific implementation, language knowledge bases in a plurality of conference fields can be constructed, such as a language knowledge base in a computer field, a language knowledge base in a medical field, a language knowledge base in a legal field, a language knowledge base in a patent field and the like.
In specific implementation, aiming at a conference field, the language knowledge of the field can be determined according to various text data and multimedia data of the field, and a corresponding language knowledge base is formed.
Step 2: and collecting conference voice data.
Taking a court trial meeting as an example, voice data of each person in the court trial process can be acquired by connecting a microphone of the court trial all-in-one machine.
In one example, the method may further comprise the steps of: and determining a conference field corresponding to the conference voice data. In specific implementation, a user may specify a conference domain, for example, a conference domain may be specified when starting a conference record; the meeting area may also be automatically determined by other means.
And step 3: and determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record.
The end-to-end speech recognition model may be a model learned from a non-conference domain corpus, and may not include language knowledge of the conference domain. In order to accurately recognize professional languages (professional terms and the like) in conference voice, the model needs to be combined with a language knowledge base in the conference field to perform voice recognition processing on conference voice data in the corresponding field to form conference records.
The end-to-end speech recognition model may be the model shown in fig. 4, 5 and 6, which includes a language knowledge coder. In specific implementation, the end-to-end speech recognition model may be learned from a corpus by a server, the server sends the model to each terminal device (e.g., a court trial all-in-one machine), the terminal device includes a language knowledge base in a related field, and the terminal device determines a text sequence corresponding to the conference speech data by using the speech recognition model and combining the language knowledge base in the related field to form a conference record.
As can be seen from the foregoing embodiments, the conference recording method provided in the embodiments of the present application constructs a language knowledge base in the conference domain; collecting conference voice data; determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record; the processing mode can perform conference voice recognition according to the language knowledge base in the conference domain, and avoids the problem that the sub-language models in each conference domain in the traditional technology need to be retrained when updating the language knowledge, so that much time is spent and the real-time performance of voice recognition is influenced; therefore, higher real-time performance and accuracy of voice recognition can be effectively considered. Meanwhile, the model depended on by the method is an end-to-end model, so that the error propagation problem existing in the non-end-to-end model can be avoided, and the accuracy of voice recognition can be effectively improved.
Twenty-seventh embodiment
In the foregoing embodiment, a conference recording method is provided, and correspondingly, the present application further provides a conference recording apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a meeting recorder includes:
the knowledge base construction unit is used for constructing a language knowledge base of the conference field;
the voice data acquisition unit is used for acquiring conference voice data;
and the voice transcription unit is used for determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base in the conference field to form a conference record.
Twenty-eighth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: constructing a language knowledge base of the conference field; collecting conference voice data; and determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record.
Twenty-ninth embodiment
In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the application further provides a conference recording method, where an execution subject of the method may be a server or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the second embodiment will not be described again, please refer to corresponding parts in embodiment two.
The conference recording method provided by the application comprises the following steps:
step 1: and constructing a language knowledge base of each field.
Step 2: and determining the field of the target conference aiming at the voice data of the target conference sent by the terminal equipment.
And step 3: and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
As can be seen from the foregoing embodiments, the conference recording method provided in the embodiments of the present application constructs a language knowledge base in each field; determining the field of a target conference aiming at voice data of the target conference sent by terminal equipment; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference; the processing mode can perform conference voice recognition according to the language knowledge base in the conference domain, and avoids the problem that the sub-language models in each conference domain in the traditional technology need to be retrained when updating the language knowledge, so that much time is spent and the real-time performance of voice recognition is influenced; therefore, higher real-time performance and accuracy of voice recognition can be effectively considered. Meanwhile, the model depended on by the method is an end-to-end model, so that the error propagation problem existing in the non-end-to-end model can be avoided, and the accuracy of voice recognition can be effectively improved.
Thirtieth embodiment
In the foregoing embodiment, a conference recording method is provided, and correspondingly, the present application further provides a conference recording apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a meeting recorder includes:
the knowledge base construction unit is used for constructing language knowledge bases of all fields;
the conference domain determining unit is used for determining the domain to which the target conference belongs according to the voice data of the target conference sent by the terminal equipment;
and the voice transcription unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
Thirty-first embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: constructing a language knowledge base of each field; determining the field of a target conference aiming at voice data of the target conference sent by terminal equipment; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
Thirty-second embodiment
In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application also provides a conference recording method, where an execution subject of the method may be a terminal device or the like. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the second embodiment will not be described again, please refer to corresponding parts in embodiment two.
The conference recording method provided by the application comprises the following steps:
step 1: and collecting voice data of the target conference.
Step 2: sending the voice data to a server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
As can be seen from the foregoing embodiments, the conference recording method provided in the embodiments of the present application acquires voice data of a target conference; sending the voice data to a server so that the server can determine the field of the target conference; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference; the processing mode can perform conference voice recognition according to the language knowledge base in the conference domain, and avoids the problem that the sub-language models in each conference domain in the traditional technology need to be retrained when updating the language knowledge, so that much time is spent and the real-time performance of voice recognition is influenced; therefore, higher real-time performance and accuracy of voice recognition can be effectively considered. Meanwhile, the model depended on by the method is an end-to-end model, so that the error propagation problem existing in the non-end-to-end model can be avoided, and the accuracy of voice recognition can be effectively improved.
Thirty-third embodiment
In the foregoing embodiment, a conference recording method is provided, and correspondingly, the present application further provides a conference recording apparatus. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a meeting recorder includes:
the voice data acquisition unit is used for acquiring voice data of the target conference;
the voice data sending unit is used for sending the voice data to the server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
Thirty-fourth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a conference recording method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: collecting voice data of a target conference; sending the voice data to a server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (59)

1. A voice interaction system, comprising:
the intelligent sound box is used for acquiring voice data of a target user and sending the voice data to the server;
the server is used for constructing an individualized language knowledge base including language knowledge in the service field of at least one sound box of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
2. A method of voice interaction, comprising:
constructing an individualized language knowledge base including language knowledge of at least one sound box service field of each user;
aiming at voice data of a target user sent by an intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user;
and executing voice interaction processing according to the text sequence.
3. The method of claim 2, wherein determining a text sequence corresponding to the speech data through an end-to-end speech recognition model and the linguistic knowledge base of the target user comprises:
determining a first text feature corresponding to the voice data through a language model included by the voice recognition model;
determining a second text feature corresponding to the voice data according to the language knowledge base and the first text feature;
determining the text sequence based at least on the second text feature.
4. The method of claim 3,
determining a second text feature corresponding to the speech data according to the language knowledge base and the first text feature, including:
and determining a second text characteristic corresponding to the voice data according to the language knowledge base and the first text characteristic through an indicator scoring model included by the voice recognition model.
5. The method of claim 4,
determining the relevancy between the word corresponding to the first text characteristic and each language knowledge through the indicator scoring model;
determining a second text feature based at least on linguistic knowledge associated with words having a relevance greater than a relevance threshold.
6. The method of claim 3,
the method further comprises the following steps:
performing encoding processing on the language knowledge by a language knowledge encoder included in the speech recognition model;
storing the coded data of the language knowledge to a language knowledge storage module included in the speech recognition model;
determining a second text feature corresponding to the speech data according to the language knowledge base and the first text feature, including:
and determining a second text characteristic according to the coded data and the first text characteristic stored by the language knowledge storage module.
7. The method of claim 2, further comprising:
learning from a training data set to obtain the speech recognition model; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information.
8. The method of claim 7, wherein the personalized linguistic knowledge base in the training data is determined by:
and constructing the personalized language knowledge base according to the text sequence marking information of the plurality of training data.
9. The method of claim 2,
the personalized language knowledge base comprises: long-tail entity words, entity words of a reverse language model, entity words of homophones and characters, and entity words in context.
10. The method of claim 2,
the audio amplifier service area includes: the field of telephone service;
the language knowledge of the calling service domain includes: the name of the user in the address list;
the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps:
receiving user address book information sent by the intelligent sound box corresponding to the user;
and taking the name in the user address list as the personalized language knowledge of the user.
11. The method of claim 2,
the audio amplifier service area includes: the field of question and answer services;
the language knowledge in the field of question-answering services comprises: a text segment in context;
the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps:
determining a context text sequence;
and taking the text segment in the context text sequence as the personalized language knowledge of the user.
12. The method of claim 2,
the audio amplifier service area includes: the field of multimedia playing services;
the language knowledge in the multimedia playing service field comprises: the name of the song;
the construction of the personalized language knowledge base including the language knowledge of at least one loudspeaker box service field of each user comprises the following steps:
determining the historical playing program name of the user;
and taking the historical playing program name as the personalized language knowledge of the user.
13. The method of claim 2, wherein the building of the personalized linguistic knowledge base of each user including linguistic knowledge of at least one speaker service domain is performed in at least one of the following ways:
determining personalized language knowledge of the user according to the shopping data of the user;
and determining the personalized language knowledge of the user according to the text information input by the user.
14. The method of claim 2, further comprising:
and updating the language knowledge base of the user according to the interactive voice data.
15. A method of voice interaction, comprising:
collecting voice data of a target user;
sending the voice data to a server side so that the server side can generate an individualized language knowledge base including language knowledge of at least one loudspeaker box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
16. A method of voice interaction, comprising:
determining an individualized language knowledge base of a user, wherein the individualized language knowledge base comprises language knowledge in the service field of at least one sound box;
aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base;
and executing voice interaction processing according to the text sequence.
17. A method for playing a television program, comprising:
determining a personalized program playing language knowledge base of a user;
aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base;
and playing the target program object according to the target program name.
18. The method of claim 17, further comprising:
and taking the program name, the actor name and/or the director name of the historical playing program object of the user as the personalized program playing language knowledge of the user.
19. The method of claim 17, wherein playing the target program object according to the target program name comprises:
determining a television channel and playing time corresponding to the target program name according to a program list;
determining a target program object according to the playing time and the television channel;
and playing the target program object.
20. The method of claim 19,
determining a target program object according to the playing time and the television channel comprises the following steps:
displaying a plurality of program objects played by at least one television channel corresponding to the target program name at a plurality of times;
and taking the program object specified by the user as a target program object.
21. The method of claim 19, further comprising:
if the program list does not comprise the target program name, determining the program name related to the target program name;
displaying the related program name;
and if the user specifies to play the related program object, playing the related program object.
22. A method for playing a television program, comprising:
determining a target program name corresponding to user program playing voice instruction data acquired by the intelligent television;
determining a target program object corresponding to the target program name according to a program list;
and playing the target program object.
23. The method of claim 22,
the determining the target program object corresponding to the target program name according to the program list comprises:
determining a historical target program object corresponding to the target program name according to a historical program list;
and determining a current target program object corresponding to the target program name according to the current program list.
24. A method for playing a television program, comprising:
the server side determines a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the intelligent television;
determining a target program object corresponding to the target program name according to a program list;
and playing the target program object through the intelligent television.
25. A method for playing a television program, comprising:
the intelligent television collects program playing voice instruction data of a user;
sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list;
and playing the target program object.
26. A conference recording method, comprising:
constructing a language knowledge base of the conference field;
collecting conference voice data;
and determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record.
27. The method of claim 26, further comprising:
and determining a conference field corresponding to the conference voice data.
28. A conference recording method, comprising:
constructing a language knowledge base of each field;
determining the field of a target conference aiming at voice data of the target conference sent by terminal equipment;
and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
29. A conference recording method, comprising:
collecting voice data of a target conference;
sending the voice data to a server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
30. A method for constructing a speech recognition model, comprising:
determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information;
constructing a network structure of an end-to-end voice recognition model;
and learning the speech recognition model from the training data set.
31. The method of claim 30,
the model comprises: the system comprises a sound coder, a decoder, a language knowledge coder, a language model, a feature fusion module and a classifier.
32. The method of claim 30,
the model comprises: the system comprises a voice coder, a language model, a language knowledge coder, a pointer scoring model, a feature fusion module and a classifier.
33. A speech recognition method, comprising:
constructing a personalized language knowledge base of the user, wherein the personalized language knowledge base comprises at least one domain language knowledge;
and aiming at user voice data collected by terminal equipment, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base.
34. A voice interaction apparatus, comprising:
the knowledge base construction unit is used for constructing an individual language knowledge base of each user, wherein the individual language knowledge base comprises at least one sound box service field language knowledge;
the voice recognition unit is used for determining a text sequence corresponding to voice data of a target user sent by the intelligent sound box through an end-to-end voice recognition model and the language knowledge base of the target user;
and the instruction processing unit is used for executing voice interaction processing according to the text sequence.
35. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: constructing an individualized language knowledge base including language knowledge of at least one sound box service field of each user; aiming at voice data of a target user sent by an intelligent sound box, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
36. A voice interaction apparatus, comprising:
the voice data acquisition unit is used for acquiring voice data of a target user;
the voice data sending unit is used for sending the voice data to the server so that the server can generate an individualized language knowledge base including language knowledge of at least one sound box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
37. An intelligent sound box, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a target user; sending the voice data to a server side so that the server side can generate an individualized language knowledge base including language knowledge of at least one loudspeaker box service field of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
38. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a target user; sending the voice data to a server side so that the server side can generate an individualized language knowledge base including at least one domain language knowledge of each user; determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base of the target user; and executing voice interaction processing according to the text sequence.
39. A speech recognition model construction apparatus, comprising:
a data preparation unit for determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information;
the network construction unit is used for constructing a network structure of an end-to-end voice recognition model;
and the network training unit is used for learning the voice recognition model from the training data set.
40. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the speech recognition model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: determining a training data set; the training data includes: voice data, a personalized language knowledge base and text sequence labeling information; constructing a network structure of an end-to-end voice recognition model; and learning the speech recognition model from the training data set.
41. A voice interaction apparatus, comprising:
a knowledge base construction unit for determining an individualized language knowledge base of a user including at least one domain language knowledge;
the voice recognition unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base aiming at the collected voice data of the user;
and the instruction processing unit is used for executing voice interaction processing according to the text sequence.
42. An intelligent sound box, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: determining an individualized language knowledge base of a user, wherein the individualized language knowledge base comprises language knowledge in the service field of at least one sound box; aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base; and executing voice interaction processing according to the text sequence.
43. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: determining a personalized linguistic knowledge base of the user including at least one domain linguistic knowledge; aiming at the collected voice data of the user, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base; and executing voice interaction processing according to the text sequence.
44. A television program playback apparatus, comprising:
the knowledge base determining unit is used for determining the personalized program playing language knowledge base of the user by the smart television;
the program identification unit is used for determining a target program name corresponding to the voice instruction data through an end-to-end voice identification model and the language knowledge base aiming at the collected voice instruction data of the program playing of the user;
and the program playing unit is used for playing the target program object according to the target program name.
45. An intelligent television, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: determining a personalized program playing language knowledge base of a user; aiming at the collected program playing voice instruction data of the user, determining a target program name corresponding to the voice instruction data through an end-to-end voice recognition model and the language knowledge base; and playing the target program object according to the target program name.
46. A television program playback apparatus, comprising:
the program name recognition unit is used for determining a target program name corresponding to the program playing voice instruction data of the user, which is acquired by the intelligent television;
the program object determining unit is used for determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object.
47. An intelligent television, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: determining a target program name corresponding to user program playing voice instruction data acquired by the intelligent television; determining a target program object corresponding to the target program name according to a program list; and playing the target program object.
48. A television program playback apparatus, comprising:
the program name recognition unit is used for determining a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the intelligent television;
the program object determining unit is used for determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object through the intelligent television.
49. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: determining a target program name corresponding to the voice instruction data aiming at the program playing voice instruction data of the user collected by the smart television; determining a target program object corresponding to the target program name according to a program list; and playing the target program object through the intelligent television.
50. A television program playback apparatus, comprising:
the voice instruction acquisition unit is used for acquiring program playing voice instruction data of a user by the intelligent television;
the voice instruction sending unit is used for sending the voice instruction data to the server so that the server can determine a target program name corresponding to the voice instruction data conveniently; determining a target program object corresponding to the target program name according to a program list;
and the playing unit is used for playing the target program object.
51. An intelligent television, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the television program playing method, and after the equipment is powered on and the program of the method is run by the processor, the following steps are executed: collecting program playing voice instruction data of a user; sending the voice instruction data to a server so that the server can determine a target program name corresponding to the voice instruction data; determining a target program object corresponding to the target program name according to a program list; and playing the target program object.
52. A conference recording apparatus, comprising:
the knowledge base construction unit is used for constructing a language knowledge base of the conference field;
the voice data acquisition unit is used for acquiring conference voice data;
and the voice transcription unit is used for determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base in the conference field to form a conference record.
53. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the conference recording method, and after the equipment is powered on and the program for realizing the conference recording method is run by the processor, the following steps are executed: constructing a language knowledge base of the conference field; collecting conference voice data; and determining a text sequence corresponding to the conference voice data through an end-to-end voice recognition model and a language knowledge base of the conference field to form a conference record.
54. A conference recording apparatus, comprising:
the knowledge base construction unit is used for constructing language knowledge bases of all fields;
the conference domain determining unit is used for determining the domain to which the target conference belongs according to the voice data of the target conference sent by the terminal equipment;
and the voice transcription unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
55. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the conference recording method, and after the equipment is powered on and the program for realizing the conference recording method is run by the processor, the following steps are executed: constructing a language knowledge base of each field; determining the field of a target conference aiming at voice data of the target conference sent by terminal equipment; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
56. A conference recording apparatus, comprising:
the voice data acquisition unit is used for acquiring voice data of the target conference;
the voice data sending unit is used for sending the voice data to the server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
57. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the conference recording method, and after the equipment is powered on and the program for realizing the conference recording method is run by the processor, the following steps are executed: collecting voice data of a target conference; sending the voice data to a server so that the server can determine the field of the target conference; and determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and a language knowledge base in the field of the target conference to form a conference record of the target conference.
58. A speech recognition apparatus, comprising:
the knowledge base construction unit is used for constructing an individualized language knowledge base including at least one domain language knowledge of a user;
and the model prediction unit is used for determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base aiming at the user voice data collected by the terminal equipment.
59. An electronic device, comprising:
a processor and a memory;
the memory is used for storing a program for realizing the voice recognition method, and after the equipment is powered on and the program for realizing the voice recognition method is run by the processor, the following steps are executed: constructing a personalized language knowledge base of the user, wherein the personalized language knowledge base comprises at least one domain language knowledge; and aiming at user voice data collected by terminal equipment, determining a text sequence corresponding to the voice data through an end-to-end voice recognition model and the language knowledge base.
CN202010552193.7A 2020-06-16 2020-06-16 Voice interaction system, related method, device and equipment Pending CN113808593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010552193.7A CN113808593A (en) 2020-06-16 2020-06-16 Voice interaction system, related method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010552193.7A CN113808593A (en) 2020-06-16 2020-06-16 Voice interaction system, related method, device and equipment

Publications (1)

Publication Number Publication Date
CN113808593A true CN113808593A (en) 2021-12-17

Family

ID=78943401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010552193.7A Pending CN113808593A (en) 2020-06-16 2020-06-16 Voice interaction system, related method, device and equipment

Country Status (1)

Country Link
CN (1) CN113808593A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115206299A (en) * 2022-09-15 2022-10-18 成都启英泰伦科技有限公司 Confusing word anti-error identification method based on command word sound identification

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice
CN102833610A (en) * 2012-09-24 2012-12-19 北京多看科技有限公司 Program selection method, apparatus and digital television terminal
CN103945351A (en) * 2013-01-17 2014-07-23 上海博路信息技术有限公司 Automatic switching system based on voice identification
US20150279365A1 (en) * 2014-04-01 2015-10-01 Google Inc. Identification of communication-related voice commands
CN105392035A (en) * 2014-09-03 2016-03-09 深圳市同方多媒体科技有限公司 System and method for switching programs of intelligent television
CN107832434A (en) * 2017-11-15 2018-03-23 百度在线网络技术(北京)有限公司 Method and apparatus based on interactive voice generation multimedia play list
CN108268450A (en) * 2018-02-27 2018-07-10 百度在线网络技术(北京)有限公司 For generating the method and apparatus of information
CN108986790A (en) * 2018-09-29 2018-12-11 百度在线网络技术(北京)有限公司 The method and apparatus of voice recognition of contact
CN109243468A (en) * 2018-11-14 2019-01-18 北京羽扇智信息科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN109325097A (en) * 2018-07-13 2019-02-12 海信集团有限公司 A kind of voice guide method and device, electronic equipment, storage medium
CN110265013A (en) * 2019-06-20 2019-09-20 平安科技(深圳)有限公司 The recognition methods of voice and device, computer equipment, storage medium
CN110297887A (en) * 2019-06-26 2019-10-01 山东大学 Service robot personalization conversational system and method based on cloud platform
CN110866099A (en) * 2019-10-30 2020-03-06 南昌众荟智盈信息技术有限公司 Intelligent steward service method and system based on intelligent sound box voice interaction
CN111145756A (en) * 2019-12-26 2020-05-12 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice
CN102833610A (en) * 2012-09-24 2012-12-19 北京多看科技有限公司 Program selection method, apparatus and digital television terminal
CN103945351A (en) * 2013-01-17 2014-07-23 上海博路信息技术有限公司 Automatic switching system based on voice identification
US20150279365A1 (en) * 2014-04-01 2015-10-01 Google Inc. Identification of communication-related voice commands
CN105392035A (en) * 2014-09-03 2016-03-09 深圳市同方多媒体科技有限公司 System and method for switching programs of intelligent television
CN107832434A (en) * 2017-11-15 2018-03-23 百度在线网络技术(北京)有限公司 Method and apparatus based on interactive voice generation multimedia play list
CN108268450A (en) * 2018-02-27 2018-07-10 百度在线网络技术(北京)有限公司 For generating the method and apparatus of information
CN109325097A (en) * 2018-07-13 2019-02-12 海信集团有限公司 A kind of voice guide method and device, electronic equipment, storage medium
CN108986790A (en) * 2018-09-29 2018-12-11 百度在线网络技术(北京)有限公司 The method and apparatus of voice recognition of contact
CN109243468A (en) * 2018-11-14 2019-01-18 北京羽扇智信息科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN110265013A (en) * 2019-06-20 2019-09-20 平安科技(深圳)有限公司 The recognition methods of voice and device, computer equipment, storage medium
CN110297887A (en) * 2019-06-26 2019-10-01 山东大学 Service robot personalization conversational system and method based on cloud platform
CN110866099A (en) * 2019-10-30 2020-03-06 南昌众荟智盈信息技术有限公司 Intelligent steward service method and system based on intelligent sound box voice interaction
CN111145756A (en) * 2019-12-26 2020-05-12 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115206299A (en) * 2022-09-15 2022-10-18 成都启英泰伦科技有限公司 Confusing word anti-error identification method based on command word sound identification
CN115206299B (en) * 2022-09-15 2022-11-11 成都启英泰伦科技有限公司 Confusing word anti-error identification method based on command word sound identification

Similar Documents

Publication Publication Date Title
CN109616108B (en) Multi-turn dialogue interaction processing method and device, electronic equipment and storage medium
CN107146612A (en) Voice guide method, device, smart machine and server
CN109036374B (en) Data processing method and device
CN106796496A (en) Display device and its operating method
US7160112B2 (en) System and method for language education using meaning unit and relational question
CN109036372B (en) Voice broadcasting method, device and system
CN109979450B (en) Information processing method and device and electronic equipment
CN110430465B (en) Learning method based on intelligent voice recognition, terminal and storage medium
CN108882101B (en) Playing control method, device, equipment and storage medium of intelligent sound box
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN108899036A (en) A kind of processing method and processing device of voice data
KR101427528B1 (en) Method of interactive language learning using foreign Video contents and Apparatus for it
CN115952272A (en) Method, device and equipment for generating dialogue information and readable storage medium
CN113808593A (en) Voice interaction system, related method, device and equipment
US11475894B2 (en) Method and apparatus for providing feedback information based on audio input
CN112837674B (en) Voice recognition method, device, related system and equipment
CN110570838B (en) Voice stream processing method and device
US11706495B2 (en) Apparatus and system for providing content based on user utterance
CN113488034A (en) Voice information processing method, device, equipment and medium
CN114694629B (en) Voice data amplification method and system for voice synthesis
US20220208190A1 (en) Information providing method, apparatus, and storage medium, that transmit related information to a remote terminal based on identification information received from the remote terminal
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN116561294A (en) Sign language video generation method and device, computer equipment and storage medium
CN113889117A (en) Voice cross-correlation system, method, device and equipment
CN113223513A (en) Voice conversion method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination