WO2022141142A1 - 一种确定目标音视频的方法及系统 - Google Patents

一种确定目标音视频的方法及系统 Download PDF

Info

Publication number
WO2022141142A1
WO2022141142A1 PCT/CN2020/141192 CN2020141192W WO2022141142A1 WO 2022141142 A1 WO2022141142 A1 WO 2022141142A1 CN 2020141192 W CN2020141192 W CN 2020141192W WO 2022141142 A1 WO2022141142 A1 WO 2022141142A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
video
target
audio
information
Prior art date
Application number
PCT/CN2020/141192
Other languages
English (en)
French (fr)
Inventor
李少红
李勇
石世壮
林俊江
覃金诚
Original Assignee
浙江核新同花顺网络信息股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江核新同花顺网络信息股份有限公司 filed Critical 浙江核新同花顺网络信息股份有限公司
Priority to PCT/CN2020/141192 priority Critical patent/WO2022141142A1/zh
Publication of WO2022141142A1 publication Critical patent/WO2022141142A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention

Definitions

  • the present application relates to the field of audio and video processing, and in particular, to a method and system for determining target audio and video.
  • human-machine dialogue technology has been widely used in human daily life, and can be used in intelligent customer service robots, intelligent audio, chat robots, and smart homes.
  • User needs can be achieved through human-machine voice dialogue.
  • users can answer questions through smart customer bots.
  • the dialogue between the intelligent customer service robot and the user is a plain text dialogue, or an instant messaging method combining voice, text, and pictures.
  • the user needs a high understanding and learning ability for the feedback message, and the user often cannot solve the problem directly based on the feedback message. Therefore, it is desirable to provide a method and system for intelligently generating audio and video, which can more intuitively answer a user's question or conduct a dialogue with the user through the audio and video.
  • One aspect of the present application provides a method for determining target audio and video.
  • the method may be implemented on a computing device, which may have at least one processor and at least one storage device.
  • the method may include the following steps: obtaining dialog information related to the user; determining dialog feature information of the dialog information; obtaining user feature information of the user; at least one target attribute corresponding to the target audio and video; and determining the target audio and video based on the at least one target attribute.
  • determining the target audio and video based on the at least one target attribute may include: acquiring a database, the database including a plurality of materials, the plurality of materials including at least one of the following materials: a or more candidate audios, one or more candidate videos, one or more candidate texts, one or more candidate images; and determining the target audio and video based on the at least one target attribute and the database.
  • determining the target audio/video based on the at least one target attribute may further include: for each item of the at least one target attribute, calculating each of the plurality of materials in the database The matching degree of each material and the target attribute; for each material in the plurality of materials in the database, based on at least one matching degree of the material corresponding to the at least one target attribute, determine a total matching score; Based on multiple matching scores corresponding to multiple materials in the database, from the multiple materials, based on the matching scores, select one or more target materials; and based on the one or more target materials, determine the Describe the target audio and video.
  • determining the target audio/video based on the at least one target attribute may further include: based on the one or more target materials and the at least one target attribute, The basic properties of each target material are adjusted to generate the target audio and video.
  • the at least one target attribute of the target audio and video may include one or more of content attribute, level of detail, difficulty of understanding, playback speed, picture tone or timbre.
  • determining at least one target attribute corresponding to the target audio and video may include: acquiring at least one trained target attribute determination model; at least a portion of the information and the user characteristic information is input to the at least one trained target attribute determination model; and the at least one target attribute is determined based on an output of the at least one trained target attribute determination model.
  • determining the target audio and video based on the at least one target attribute may include: acquiring a trained material determination model; inputting the dialogue feature information into the material determination model; determining based on the material The output of the model determines the initial audio and video; and based on the at least one target attribute, the target audio and video is generated by adjusting the basic attributes of the initial audio and video.
  • the target audio and video may include one or more segments.
  • the method may further include: determining whether the user provides user feedback during the playback of the target audio and video; and in response to the user providing the user feedback during the playback of the target audio and video and user feedback, and based on the user feedback, it is determined whether the basic attribute of at least one unplayed segment in the one or more segments of the target audio and video needs to be adjusted.
  • the user feedback provided by the user may include one or more of the following: the number of pauses, the pause duration, the number of playback times, the playback duration, the The number of forwards, the duration of fast forwards, the number of slow playbacks, the duration of slow playbacks, whether to ask new questions, and whether to end the playback early.
  • the user characteristic information may include user personal information, and the user personal information includes one or more of the following: age, gender, education, work background, and health status.
  • the user feature information may include the user's preference information
  • the user's preference information may include the user's preference settings, the user's current mood, or historical user information provided by the user for historical audio and video in the past At least one of the feedback information.
  • the system may include: at least one memory for storing computer instructions; at least one processor in communication with the memory, wherein when the at least one processor executes the computer instructions, the at least one processor causes all the The system executes: obtaining the dialog information related to the user; determining the dialog feature information of the dialog information; obtaining the user feature information of the user; based on the dialog feature information and the user feature information, determining the corresponding audio and video at least one target attribute; and determining the target audio and video based on the at least one target attribute.
  • the at least one processor may cause the system to further perform: acquiring a database, the database including a plurality of materials, the The plurality of materials include at least one of the following materials: one or more candidate audios, one or more candidate videos, one or more candidate texts, and one or more candidate images; and based on the at least one target attribute and The database determines the target audio and video.
  • the at least one processor may cause the system to further perform: for each of the at least one target attribute , calculate the degree of matching between each of the multiple materials in the database and the target attribute; for each of the multiple materials in the database, at least one matching degree of the attribute, determining a total matching score; based on multiple matching scores corresponding to multiple materials in the database, selecting one or more target materials from the multiple materials based on the matching scores; and The target audio and video is determined based on the one or more target materials.
  • the at least one processor may cause the system to further perform: based on the one or more target materials and the at least one A target attribute, the target audio and video is generated by adjusting the basic attributes of the one or more target materials.
  • the at least one target attribute of the target audio and video may include one or more of content attribute, level of detail, difficulty of understanding, playback speed, picture tone or timbre.
  • the at least one processor may cause the system to further perform: acquiring at least one a trained target attribute determination model; inputting at least a part of the dialogue feature information and the user feature information into the at least one trained target attribute determination model; and determining the model based on the at least one trained target attribute As an output, the at least one target attribute is determined.
  • the at least one processor may cause the system to further perform: acquiring a trained material to determine a model; information is input into the material determination model; based on the output of the material determination model, initial audio and video is determined; and based on the at least one target attribute, the target audio is generated by adjusting basic properties of the initial audio and video video.
  • the target audio and video may include one or more segments.
  • the at least one processor may cause the system to further perform: determining whether the user provides user feedback during the playback of the target audio and video; and in response to the playback of the target audio and video During the process, the user provides user feedback, and based on the user feedback, it is determined whether it is necessary to adjust the basic attribute of at least one unplayed segment in the one or more segments of the target audio and video.
  • the user feedback provided by the user may include one or more of the following: the number of pauses, the pause duration, the number of playback times, the playback duration, the The number of forwards, the duration of fast forwards, the number of slow playbacks, the duration of slow playbacks, whether to ask new questions, and whether to end the playback early.
  • the user characteristic information may include user personal information, and the user personal information may include one or more of the following: age, gender, education, work background, and health status.
  • the user characteristic information may include the user's preference information
  • the user's preference information may include the user's preference settings, the user's current mood, or the user's past audio and video information provided by the user. At least one item of historical user feedback information.
  • the apparatus may include: an acquisition module for acquiring dialogue information related to a user; a determination module for determining dialogue characteristic information of the dialogue information; an acquisition module for acquiring user characteristic information of the user; a determination module , for determining at least one target attribute corresponding to the target audio and video based on the dialogue feature information and user feature information; and a determination module for determining the target audio and video based on the at least one target attribute.
  • the storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer executes the method.
  • the method may include: acquiring dialogue information related to a user; determining dialogue feature information of the dialogue information; acquiring user feature information of the user; at least one target attribute corresponding to the video; and determining the target audio and video based on the at least one target attribute.
  • FIG. 1 is a scene diagram of a system for determining target audio and video according to some embodiments of the present application
  • FIG. 2 is a schematic diagram of exemplary hardware and/or software components of an exemplary computing device shown in accordance with some embodiments of the present application;
  • FIG. 3 is a schematic diagram of an exemplary terminal device according to some embodiments of the present application.
  • FIG. 4 is a block diagram of an exemplary processing device shown in accordance with some embodiments of the present application.
  • FIG. 5 is an exemplary flowchart of determining target audio and video according to some embodiments of the present application.
  • FIG. 6 is a flowchart of determining at least one target attribute corresponding to target audio and video according to some embodiments of the present application
  • FIG. 7 is a flowchart of determining target audio and video based on a database according to some embodiments of the present application.
  • FIG. 8 is another flowchart of determining target audio and video according to some embodiments of the present application.
  • FIG. 9 is a schematic diagram of interaction between a terminal and a server according to some embodiments of the present application.
  • system means for distinguishing different components, elements, parts, parts or assemblies at different levels.
  • device means for converting components, elements, parts, parts or assemblies to different levels.
  • the present application discloses a system and method for determining target audio and video.
  • the method may include acquiring dialogue information related to a user; determining dialogue characteristic information of the dialogue information; acquiring user characteristic information of the user; at least one target attribute; and determining the target audio and video based on the at least one target attribute.
  • audio-video is also referred to as A/V, meaning audio or video.
  • target audio and video can be provided according to the user's needs, thereby giving the user a better experience.
  • the method may further include determining, based on user feedback, whether it is necessary to adjust basic properties of at least one unplayed segment in the one or more segments of the target audio and video. According to the user's feedback in the process of watching the target audio and video, the basic attributes (eg, playback speed, etc.) of the target audio and video can be adjusted to bring a better viewing experience to the user.
  • FIG. 1 is a schematic diagram of an exemplary system for determining target audio and video according to some embodiments of the present application.
  • the system 100 (or simply the system 100 ) for determining target audio and video may be a system for human-machine dialogue.
  • the system 100 can be applied to various intelligent human-machine dialogue devices, including but not limited to intelligent customer service robots, smart speakers, chat robots, smart home devices (eg, smart TVs, smart air conditioners, smart sweeping/mopping devices), smart transportation, etc.
  • the system 100 can also provide interactive services for the user in combination with the webpage or APP on the terminal.
  • the server 110 in the system 100 can answer questions for the user through the intelligent customer service system on the APP. This application does not limit this.
  • system 100 may include server 110 , network 120 , terminal 130 and storage device 140 .
  • server 110 may be a single server or a group of servers.
  • the server group may be centralized or distributed (eg, server 110 may be a distributed system).
  • server 110 may be local or remote.
  • the server 110 may access information and/or data stored in the terminal 130 and/or the storage device 140 via the network 120 .
  • the server 110 may be directly connected to the terminal 130 and/or the storage device 140 to access stored information and/or data.
  • server 110 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distribution cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
  • server 110 may be implemented on computing device 200 that includes one or more of the components shown in FIG. 2 .
  • server 110 may include processing device 112 .
  • Processing device 112 may process information and/or data related to determining target audio and video to perform one or more functions described herein. For example, the processing device 112 may determine at least one target attribute corresponding to the target audio and video based on the dialogue feature information and the user feature information. The processing device 112 may also determine the target audio and video based on the at least one target attribute.
  • the processing device 112 may include one or more processing engines (eg, a single-core processing engine or a multi-core processing engine).
  • the processing device 112 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), an application specific instruction set processor (ASIP), a graphics processing unit (GPU), a physical processing unit (PPU), a digital signal processor (DSP), Field programmable gate array (FPGA), programmable logic device (PLD), controller, microcontroller unit, reduced instruction set computer (RISC), microprocessor, etc., or any combination thereof.
  • the processing device 112 may be integrated in the terminal 130 .
  • Network 120 may facilitate the exchange of information and/or data.
  • one or more components of system 100 eg, server 110 , terminal 130 , or storage device 140
  • the server 110 may obtain dialog information related to the user from the terminal 130 via the network 120 .
  • the network 120 may be a wired network or a wireless network, or the like, or any combination thereof.
  • the network 120 may include a cable network, a wired network, a fiber optic network, a telecommunications network, an internal network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN) , Public Switched Telephone Network (PSTN), Bluetooth network, ZigBee network, Near Field Communication (NFC) network, etc. or any combination thereof.
  • network 120 may include one or more network access points.
  • network 120 may include wired or wireless network access points, such as base stations and/or Internet exchange points 120-1, 120-2, . . . Through access points, one or more components of system 100 for targeting audiovisual may be connected to network 120 to exchange data and/or information.
  • a user may be an individual using terminal 130 .
  • the user can use the terminal 130 to conduct conversations, watch target audio and video, and the like.
  • terminal 130 may include mobile device 130-1, tablet computer 130-2, laptop computer 130-3, in-vehicle device 130-4, etc., or any combination thereof.
  • mobile device 130-1 may include a smart home device, wearable device, smart mobile device, virtual reality device, augmented reality device, smart customer robot, chat robot, smart vehicle, etc., or any combination thereof.
  • smart home devices may include smart lighting devices, smart appliance control devices, smart monitoring devices, smart TVs, smart cameras, walkie-talkies, smart speakers, smart sweeping/mopping devices, etc., or any combination thereof.
  • the wearable device may include smart bracelets, smart footwear, smart glasses, smart helmets, smart watches, smart clothing, smart backpacks, smart accessories, etc., or any combination thereof.
  • an intelligent mobile device may include a smartphone, personal digital assistant (PDA), gaming device, navigation device, point of sale (POS), etc., or any combination thereof.
  • the virtual reality device and/or augmented virtual reality device may include a virtual reality headset, virtual reality glasses, virtual reality goggles, augmented reality helmet, augmented reality glasses, augmented reality goggles, etc., or any combination thereof.
  • virtual reality devices and/or augmented reality devices may include Google Glass TM , Oculus Rift TM , Hololens TM , or Gear VR TM , among others.
  • in-vehicle equipment 130-4 may include an in-vehicle computer, an in-vehicle television, and the like.
  • the storage device 140 may store data and/or instructions related to determining target audio and video. In some embodiments, storage device 140 may store data obtained from terminal 130 . In some embodiments, storage device 140 may store data and/or instructions that server 110 may execute or use to perform the exemplary methods described in this disclosure. In some embodiments, storage device 140 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), the like, or any combination thereof. Exemplary mass storage may include magnetic disks, optical disks, solid state disks, and the like. Exemplary removable storage may include flash drives, floppy disks, optical disks, memory cards, compact disks, magnetic tapes, and the like. Exemplary volatile read-write memory may include random access memory (RAM).
  • RAM random access memory
  • Exemplary RAMs may include dynamic random access memory (DRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), static random access memory (SRAM), thyristor random access memory (T-RAM), and zero Capacitive random access memory (Z-RAM), etc.
  • Exemplary ROMs may include masked read only memory (MROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), optical disk only Read-only memory (CD-ROM) and digital versatile disk read-only memory, etc.
  • the storage device 140 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distribution cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
  • storage device 140 may be connected to network 120 to communicate with one or more components of determination system 100 (eg, server 110, terminal 130). It is determined that one or more components of system 100 may access data and/or instructions stored in storage device 140 via network 120 . In some embodiments, storage device 140 may be directly connected to or in communication with one or more components of system 100 (eg, server 110, terminal 130). In some embodiments, storage device 140 may be part of server 110 .
  • the elements may execute through electrical and/or electromagnetic signals.
  • the processor of the terminal 130 may generate an electrical signal of an encoding request.
  • the processor of the terminal 130 can then send the electrical signal to the output port.
  • the output port may be physically connected to a cable, which further transmits electrical signals to the input port of the server 110 .
  • the output port of the terminal 130 may be one or more antennas that convert electrical signals into electromagnetic signals.
  • an electronic device such as terminal 130 and/or server 110
  • when a processor processes an instruction, issues an instruction, and/or performs an action the instruction and/or action is performed through electrical signals.
  • a processor retrieves or saves data from a storage medium (eg, storage device 140 ), it can send electrical signals to a read/write device of the storage medium, which can read or write structured data in the storage medium .
  • the structural data can be transmitted to the processor in the form of electrical signals through the bus of the electronic device.
  • an electrical signal may refer to one electrical signal, a series of electrical signals and/or at least two discontinuous electrical signals.
  • FIG. 2 is a schematic diagram of example hardware and/or software components of an example computing device shown in accordance with some embodiments of the present application.
  • the server 110 and the terminal 130 may be executed on the computing device 200 .
  • processing device 112 may be implemented on computing device 200 and configured to perform the functions of processing device 112 disclosed in this application.
  • Computing device 200 may be used to implement any component of system 100 for determining target audiovisual as described herein.
  • processing device 112 may be implemented on computing device 200 by hardware, software programs, firmware, or a combination thereof.
  • Only one such computer is shown, for convenience, computer functions related to the human-machine dialogue techniques described herein may be implemented in a distributed fashion across multiple similar platforms to distribute the processing load.
  • Computing device 200 may include a communication port 250 connected to a network to which it is connected to facilitate data communication.
  • Computing device 200 may also include a processor 220 that executes program instructions in the form of one or more logic circuits.
  • the processor 220 may include interface circuitry and processing circuitry therein.
  • the interface circuit may be configured to receive electrical signals from the bus 210, where the electrical signals encode structured data and/or instructions for the processing circuit.
  • the processing circuit may perform logical calculations and then determine conclusions, results and/or instruction codes as electrical signals.
  • the interface circuit may then issue electrical signals from the processing circuit via the bus 210 .
  • Computing device 200 may also include various forms of program storage and data storage, including, for example, disk 270 , read only memory (ROM) 230 , or random access memory (RAM) 240 for storage to be processed and/or transmitted by computing device 200 various data files.
  • Computing device 200 may also include program instructions stored in ROM 230 , RAM 240 , and/or other types of non-transitory storage media executed by processor 220 .
  • the methods and/or processes of the present application may be implemented in the form of program instructions.
  • Computing device 200 also includes input/output components 260 to support input/output between the computer and other components.
  • Computing device 200 may also receive programming and data through network communications.
  • FIG. 2 For convenience of explanation, only one processor is depicted in FIG. 2 . At least two processors may also be included, so that operations and/or method steps described herein that are performed by one processor may also be performed jointly or individually by multiple processors. For example, if operation A and operation B are performed in the processor of the present application in computing device 200, it should be understood that operation A and operation B may also be performed jointly or separately by two different CPUs and/or processors in computing device 200 (eg, a first processor performs operation A and a second processor performs operation B, or the first and second processors perform operations A and B together).
  • terminal 130 may be implemented on mobile device 300 .
  • the mobile device 300 may include a communication platform 310, a display 320, a graphics processing unit (GPU) 330, a central processing unit (CPU) 340, an input/output (I/O) 350, a memory 360, a mobile operating system (OS) 370 and memory 390.
  • any other suitable components including but not limited to a system bus or controller (not shown), may also be included within mobile device 300 .
  • a mobile operating system 370 eg, iOS TM , Android TM , Windows Phone TM
  • one or more applications 380 may be loaded from memory 390 into memory 360 for execution by CPU 340 .
  • Application 380 may include a browser or any other suitable mobile application for receiving and presenting information related to targeting audiovisual or other information from system 100 for targeting audiovisual. User interaction with the stream of information may be accomplished through I/O 350 and provided through network 120 to processing device 112 and/or other components of system 100 that determine target audio and video.
  • a computer hardware platform may be used as the hardware platform for one or more of the components described herein.
  • a computer with user interface components can be used to implement a personal computer (PC) or any other type of workstation or terminal device. If the computer is properly programmed, the computer can also function as a server.
  • the processing device may be the exemplary processing device 112 described in FIG. 1 .
  • the processing device 112 may determine the target audio and video based on at least one target attribute.
  • processing device 112 may be implemented on a processing unit (eg, processor 210 shown in FIG. 2 or CPU 340 shown in FIG. 3).
  • the processing device 112 may be implemented on the CPU 340 of the terminal device.
  • the processing device 112 may include an acquisition module 410 , a determination module 420 , and a training module 430 .
  • the obtaining module 410 can obtain information related to the system 100 .
  • the obtaining module may obtain the dialog information sent by the user and the user characteristic information of the user.
  • the dialog information sent by the user and the user feature information of the user reference may be made to relevant descriptions elsewhere in this application (for example, FIG. 5 and its related descriptions), which will not be repeated here.
  • the determining module 420 may determine the dialogue feature information of the dialogue information.
  • the determining module 420 may determine at least one target attribute corresponding to the target audio and video based on the dialogue feature information and the user feature information.
  • the determining module 420 may also determine the target audio and video based on the at least one target attribute.
  • the determining module 420 may also determine whether the user provides user feedback during the playback of the target audio and video.
  • the determining module 420 may also respond to the user providing user feedback during the playback of the target audio and video, and based on the user feedback, determine whether it is necessary to adjust one or more segments of the target audio and video. Basic properties of at least one unplayed segment of .
  • the at least one target attribute, the target audio and video, the user feedback, and whether it is necessary to adjust at least one unplayed segment in one or more segments of the target audio and video For the determination description of the basic attribute, reference may be made to relevant descriptions elsewhere in this application (for example, FIGS. 5-8 and their relevant descriptions), which will not be repeated here.
  • the training module 430 can be used to train a target attribute determination model and a material determination model.
  • a target attribute determination model For the description of the training of the target attribute determination model and the material determination model, reference may be made to relevant descriptions elsewhere in this application (eg, FIG. 5 , FIG. 8 , and related descriptions), which will not be repeated here.
  • system and its modules shown in FIG. 4 can be implemented in various ways.
  • the system and its modules may be implemented in hardware, software, or a combination of software and hardware.
  • the hardware part can be realized by using dedicated logic;
  • the software part can be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware.
  • a suitable instruction execution system such as a microprocessor or specially designed hardware.
  • the methods and systems described above may be implemented using computer-executable instructions and/or embodied in processor control code, for example on a carrier medium such as a disk, CD or DVD-ROM, such as a read-only memory (firmware) ) or a data carrier such as an optical or electronic signal carrier.
  • the system and its modules of this specification can be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc. , can also be implemented by, for example, software executed by various types of processors, and can also be implemented by a combination of the above-mentioned hardware circuits and software (eg, firmware).
  • processing device 400 and its modules is only for the convenience of description, and does not limit the description to the scope of the illustrated embodiments. It can be understood that for those skilled in the art, after understanding the principle of the system, various modules may be combined arbitrarily, or a subsystem may be formed to connect with other modules without departing from the principle. For example, in some embodiments, the acquisition module and the determination module disclosed in FIG. 4 may be different modules in a system, or a module may implement the functions of two or more modules described above.
  • the determination module can be subdivided into a dialogue feature information determination unit, a target attribute determination unit, and a target audio and video determination unit, which are respectively used for determining dialogue feature information, determining at least one target attribute, and determining target audio and video.
  • each module in the processing device 400 may share one storage module, and each module may also have its own storage module. Such deformations are all within the protection scope of this specification.
  • the training module in FIG. 4 may be omitted. The training process for one or more machine learning models can be done by external processing devices.
  • FIG. 5 is an exemplary flowchart of determining target audio and video according to some embodiments of the present application.
  • process 500 may be implemented by a set of instructions (eg, application program) to achieve.
  • the processor 220 and/or the modules in FIG. 4 may execute a set of instructions, and when the instructions are executed, the processor 220 and/or the modules may be configured to perform the process 500 .
  • the operation of the procedure shown below is for illustration purposes only. In some embodiments, process 500 may be accomplished using one or more additional operations not described and/or one or more operations not discussed herein. Additionally, the order in which the process operations shown in FIG. 5 and described below are not limiting.
  • the processing device 112 may acquire dialog information related to the user.
  • the dialogue information sent by the user refers to the dialog information sent by the user through the terminal and/or the dialog information received by the user through the terminal.
  • the dialogue information sent by the user includes, but is not limited to, the form of voice, text, and pictures.
  • a user can interact with a computing device (eg, computing device 200 , server 110 ) by sending dialog information through a user interface on a terminal (eg, terminal 130 shown in FIG. 1 ).
  • a terminal eg, terminal 130 shown in FIG. 1
  • users can have conversations with smart customer service bots, chatbots, and more.
  • the processing device 112 may obtain the dialog information sent by the user from the terminal through the network 120 .
  • the processing device 112 may acquire the dialog information sent by the user from the terminal in real time through the network 120 .
  • the processing device 112 may obtain the most recently issued dialog information and/or contextual dialog information by the user.
  • the latest dialogue information sent by the user may include one or more words or words, a sentence, a paragraph, one or more voice messages, one or more pictures, and the like.
  • the dialog information newly issued by the user may include declarative sentences (for example, "Hello"), question sentences (for example, "how the system is used"), and the like.
  • the user may ask a question through the user interface on the terminal, and the processing device 112 may answer the user's question by generating target audio and video.
  • the contextual dialog information may include continuous information sent and received by the user through the terminal prior to sending the latest message.
  • the processing device 112 may analyze the information sent by the user, and send feedback information (such as text, voice, pictures, audio and video, etc.) to the terminal.
  • the user can continue to send new information based on the feedback information.
  • the new information is the latest dialogue information sent by the user
  • the contextual dialogue information includes the information previously sent by the user and the above-mentioned feedback information (also called two-round continuous dialogue information).
  • the contextual dialogue information may include the latest multi-round dialogue information, or may include all dialogue information.
  • the number of rounds of dialog messages may be preset in the system 100.
  • the processing device 112 may determine dialog feature information for the dialog information.
  • the dialogue feature information may include keywords, emotions, etc. in the dialogue information.
  • the processing device 112 may determine the dialog characteristic information from the dialog information. For example, the processing device 112 may determine the conversation characteristics from the most recently issued conversation information. If the dialogue information is text, the processing device 112 may extract keywords in the dialogue information according to a keyword extraction technique. Exemplary keyword extraction techniques may include, but are not limited to, Topic model, TFIDF, TextRank, RAKE, etc. techniques. If the dialogue information is speech, the processing device 112 may convert the speech information into text through speech recognition technology, and then perform dialogue feature information extraction on the converted text.
  • the processing device 112 may identify the dialogue feature information in the picture in the dialogue information through image recognition technology. For example, the processing device 112 may recognize the text in the picture, and identify the keyword through the text content. For another example, the processing device 112 may identify features in the picture, thereby judging the emotion expressed by the picture. For example, if the picture sent by the user is an angry expression, the processing device 112 may extract image features from the picture, so as to determine the emotional content.
  • processing device 112 may determine dialog feature information based on contextual dialog information. By analyzing the contextual dialogue information and performing semantic recognition, the dialogue feature information can be determined. Specifically, the processing device 112 may determine the dialogue feature information through a hierarchical model or a non-hierarchical model.
  • the processing device 112 may obtain user characteristic information of the user.
  • the user feature information may include user personal information, user preference information, other user information (eg, user hobbies), and the like.
  • User personal information may include age, gender, education background, work background, health status, home address, marital status, educational background, etc. or a combination thereof.
  • the user may input user personal information, such as voice input, text input, etc., through the terminal.
  • the processing device 112 may obtain user personal information from the terminal through the network 120 .
  • an APP may be installed on the terminal, and the user interacts with the APP through the terminal and needs to log in to the APP.
  • the processing device 112 may obtain the user ID of the user through the terminal, and obtain the user personal information corresponding to the user ID from the storage device 140 through the network 120 .
  • the user's preference information may include the user's preference settings, the user's current mood, or historical user feedback information provided by the user for historical audio and video in the past.
  • the preference settings may include the user's preference for the playback speed of the target audio and video (eg, slow, normal, fast), the user's preference for the playback sound of the target audio and video (eg, female voice, male voice), the user's preference for the target audio and video playback content preferences (for example, concise, detailed), user preferences for target audio and video playback quality (for example, Blu-ray, high-definition, standard-definition), etc.
  • the user may provide the user's preference settings at the terminal.
  • the terminal 130 may provide a preference setting selection page for the user to select (eg, provided through the above-mentioned APP).
  • the user's preference settings may be saved in storage device 140 .
  • the processor 112 may obtain the user preference information corresponding to the user ID from the storage device 140 through the network 120 based on the user ID used by the user to log in to the APP.
  • the user's current emotions may include happiness, liking, sadness, surprise, anger, fear, and disgust. Alternatively, the user's emotions can be categorized as positive, negative, neutral, etc.
  • the processing device 112 may identify the user's current emotion from the dialog information. For example, when the dialog information related to the user is text information, the processing device 112 may identify the current emotion of the user through text sentiment analysis technology. Exemplary textual sentiment analysis techniques may include, but are not limited to, techniques based on keyword extraction rules, techniques based on machine learning models (eg, support vector machines, neural networks, logistic regression, etc.), etc., or combinations thereof.
  • the processing device 112 may obtain a list of emotional keywords (eg, positive words, negative words or words expressing anger, words expressing happiness, words expressing sadness, etc.), extract the emotional keywords from the text, and compare the emotional keyword list Comparisons are made to determine the sentiment expressed by the text.
  • a list of emotional keywords eg, positive words, negative words or words expressing anger, words expressing happiness, words expressing sadness, etc.
  • the historical user feedback information provided by the user for historical audio and video in the past may include the user's feedback on the content of the audio and video, the feedback on the audio and video playback sound, the feedback on the audio and video playback speed, the feedback on the video playback quality, the feedback on the audio and video playback Feedback of the playback process, etc., or any combination thereof.
  • the processing device 112 may determine the historical user feedback through the user's operations on the historical audio and video. For example, the processing device 112 may determine the user's feedback on the audio and video playback content through operations such as pause, playback, fast-forward, etc. performed by the user on the historical audio and video.
  • the processing device 112 may determine the feedback on the playback speed of the audio and video through the user's operation of adjusting the playback speed of the historical audio and video. For another example, the processing device 112 may determine the feedback on the video playback picture quality through the picture quality adjustment operation performed by the user on the historical video.
  • the processing device 112 may determine at least one target attribute corresponding to the target audio and video based on the dialog feature information and the user feature information.
  • the target attribute may include, but is not limited to, semantic information, level of detail, difficulty of understanding, playback speed, timbre of playback sound, playback picture quality, picture tone, etc., or a combination thereof.
  • the target attribute includes at least semantic information.
  • the semantic information refers to the meaning to be expressed by the feedback information determined by the processing device 112 according to the user's dialogue information (eg, a question).
  • the processing device 112 may determine semantic information from the extracted dialogue feature information.
  • semantic information may be expressed in the form of keywords. For example, the keyword extracted based on the latest dialogue information sent by the user is "hello", and the determined semantic information may be "hello".
  • target attributes may be determined based on user characteristic information through attribute determination rules. Attribute determination rules can be used to determine all target attributes or individual target attributes.
  • attribute determination rules may include comparing individual user characteristic information with preset reference information (e.g., categories or thresholds) to determine target attributes.
  • a single or multiple target attributes can be determined through a single user feature information, or a single target attribute can be determined through multiple user features.
  • multiple target attributes may be determined based on the user's age compared to an age threshold. If it is greater than the first age threshold (for example, 60 years old), the level of detail is detailed, the difficulty of understanding is simple, the playback speed is slow, the tone is normal, and the picture tone is calm such as cool tones.
  • the age of the user is less than the second age threshold (for example, 10 years old), the level of detail is detailed, the difficulty of understanding is simple, the playback speed is slow, and the timbre is the timbre that children like (such as the timbre of a cartoon character), and the screen The tones are warm.
  • the difficulty of comprehension may be determined according to the comparison of the user's educational background with the educational degree category. If the degree belongs to primary school students, the difficulty of understanding is lower.
  • the attribute determination rule may include combining multiple user feature information and comparing with preset reference information to determine a single target attribute.
  • weights may be assigned to a plurality of user feature information and combined in a weighted manner. For example, the three features of user age, user education, and user emotion can be combined with weights of 0.3, 0.4, and 0.3, respectively. If the user is between 18 and 50 years old, the comprehension difficulty score corresponding to the age is 0.5 (or moderate). If the education is high school, the comprehension difficulty score corresponding to the education is 0.3 (or lower). Emotions are irritable or impatient, and the comprehension difficulty score corresponding to emotions is 0.3 (or low).
  • a numerical value eg, the total comprehension difficulty score
  • a category may be used to reflect target attributes. For example, if the total score is low, the detail level may be low.
  • the processing device 112 may also determine the target attribute based on the machine learning model. Specifically, the processing device 112 may acquire at least one trained target attribute determination model; input at least a part of the dialogue feature information and the user feature information into the at least one trained target attribute determination model; and based on the At least one trained target attribute determines the output of the model, and the at least one target attribute is determined.
  • the processing device 112 may acquire at least one trained target attribute determination model; input at least a part of the dialogue feature information and the user feature information into the at least one trained target attribute determination model; and based on the At least one trained target attribute determines the output of the model, and the at least one target attribute is determined.
  • FIG. 6 For the related content of using the machine learning model to determine the target attribute, reference may be made to FIG. 6 and its description, which will not be repeated here.
  • the processing device 112 may determine the target audiovisual based on at least one target attribute.
  • the user can be provided with personalized target audio and video content, which can provide the user with a better viewing experience, help the user to answer questions, and/or improve the user's human-computer interaction experience.
  • the processing device 112 may retrieve the database from the storage device 140 .
  • the database may include a plurality of materials, such as at least one of one or more candidate audios, one or more candidate videos, one or more candidate texts, and one or more candidate images.
  • Each material in the database has at least one basic attribute, such as semantic information, level of detail, difficulty of understanding, emotional attributes, etc.
  • the processing device 112 may select a target audio/video matching the target attribute from the database based on at least one target attribute.
  • the dialogue information sent by the user is "how to use this system?”; the dialogue feature information extracted by the processor can be "system” and “how to use”; the processor 112 obtains the user's preference settings through the network 120, and the user The preference of the audio and video playback content is set to detailed; then the target audio and video determined by the processor 112 may be a detailed version of the video on how to use the system.
  • one or more non-audio and video materials matching the target attribute may be obtained from the database, such as text, pictures, etc. .
  • the processor 112 may directly send the text or picture matching the target attribute to the terminal to conduct a dialogue with the user.
  • the processor 112 may also generate the target audio and video based on the above-mentioned one or more non-audio and video materials matching the target attributes.
  • the processor 112 may generate speech based on the text and synthesize the text, images and speech into the target video.
  • FIG. 7 and its description which will not be repeated here.
  • the processing device 112 may determine the target audio and video based on at least one target attribute according to the machine learning model. For the content of determining the target audio and video according to the machine learning model, reference may be made to FIG. 8 and its description, which will not be repeated here.
  • the target audio and video may be a single video.
  • the target audio and video may include multiple segments arranged in sequence.
  • the above-mentioned multiple segments can be played in sequence.
  • multiple segments may be segments of similar content.
  • the processor 112 may find a plurality of audios and videos matching the target attributes from the database, and sort the plurality of audios and videos according to certain rules to form a complete target audio and video. For example, the processor 112 may sort the above-mentioned multiple audios and videos according to the matching degree of the audios and videos with the target attribute.
  • the processor 112 may sort the plurality of audios and videos according to the value of a certain item of the target attribute, for example, according to the value of understanding difficulty in ascending order (that is, the understanding difficulty of the plurality of audios and videos increases).
  • the processing device 112 may directly generate the target audio and video including multiple segments according to a machine learning model (eg, the material determination model described in FIG. 8 ).
  • the processing device 112 may determine whether the user provides user feedback during the playback of the target audio and video.
  • the user feedback provided by the user may include the number of pauses, the duration of the pause, the number of playbacks, the playback duration, the number of fast forwards, the duration of fast forwards, the number of slow playbacks, the slow playback duration, whether to propose a new question, whether to end playback early, etc., or a combination thereof.
  • These user feedbacks may indicate the user's ability to understand and absorb the target audio and video, and/or the user's acceptance level (eg, like or dislike) the target audio and video.
  • the processing device 112 may determine whether the user provides user feedback through the operation performed by the user on the terminal 130 on the target audio and video.
  • the terminal 130 can determine the pause operation, playback operation, fast-forward operation, slow-play operation, close operation, and operation of sending a new message performed by the user on the target audio and video on the terminal 130, and then transmit the operation to the processing device 112 through the network 120. .
  • the terminal 130 may detect the user's operation of some buttons on the user interface and/or the user's gesture operation on the user interface.
  • the terminal 130 can detect the pause button, playback, fast forward, slow play button, and close button that the user clicks on the target audio and video, and determine to execute the pause operation, playback operation, fast forward operation, slow playback operation, and close operation.
  • the number of fast-forwards of the target audio and video by the user is greater than a certain threshold (for example, 3 times), this may reflect that the target audio and video content is too simple and easy for the user to understand, or the user does not like the content of the target audio and video, or the user Preference for faster playback speed, etc.
  • a certain threshold for example, 3 times
  • the processing device 112 may, in response to the user providing user feedback during the playback of the target audio and video, determine whether the target audio needs to be adjusted based on the user feedback. Basic properties of at least one unplayed segment of one or more segments of the video. By determining whether the user provides user feedback, the processor 112 can further optimize the content of the target audio and video to provide the user with a better viewing effect and improve the user experience.
  • the base attribute of the unplayed segment may correspond to the target attribute.
  • Base attributes can include only some of the target attributes, eg, level of detail. Base attributes can also include all target attributes.
  • the basic attributes may include semantic information, level of detail, difficulty of understanding, playback speed, playback sound, playback picture quality, picture tone, etc., or a combination thereof.
  • the processing device 112 may adjust basic properties of at least one unplayed segment in one or more segments of the target audio and video according to the user feedback.
  • the adjusted base attributes may be determined through attribute adjustment rules. Attribute adjustment rules may be similar to attribute determination rules.
  • the attribute adjustment rules may include whether a single type of feedback in the user feedback is greater than a threshold.
  • a threshold eg, 3
  • the playback speed of at least one unplayed segment of the target audio and video may be adjusted, for example, the playback speed may be adjusted to 1.5 times the playback speed.
  • process 500 may further include storing the target audio and video in storage device 140 .
  • operations 560-570 in process 500 may be omitted.
  • FIG. 6 is a flowchart of determining at least one target attribute corresponding to target audio and video according to some embodiments of the present application.
  • one or more steps of process 600 may be performed to obtain at least one target attribute described in step 540 described in FIG. 5 .
  • process 600 may be implemented by a set of instructions (eg, application program) to achieve.
  • the processor 220 and/or the modules in FIG. 4 may execute a set of instructions, and when the instructions are executed, the processor 220 and/or the modules may be configured to perform the process 500 .
  • the operation of the procedure shown below is for illustration purposes only.
  • process 600 may be accomplished using one or more additional operations not described and/or one or more operations not discussed herein. Additionally, the order in which the process operations shown in FIG. 6 and described below are not limiting.
  • the processing device 112 may acquire at least one trained target attribute determination model.
  • the target attribute determination model may be a model for determining at least one target attribute corresponding to the target audio and video, such as a machine learning model.
  • the at least one target attribute determination model may include a deep learning model, eg, a deep neural network (DNN) model, a convolutional neural network (CNN) model, a recurrent neural network (RNN) model, a feature map pyramid Network (FPN) model, Seq2Seq model, Long Short Term Memory (LSTM) model, etc.
  • the target attribute determination model may receive model inputs (eg, dialog feature information, user profile information, and/or other user-related information), and the target attribute determination model may output at least one target attribute information.
  • the target attribute determination model may output a sequence of audio and video attribute information.
  • the audio-video attribute information sequence may include multiple groups of attribute information arranged in sequence, wherein each group of the multiple sets of attribute information corresponds to a segment of the target audio-video.
  • processing device 112 may determine the provision of the model from one or more components of system 100 (eg, storage device 140 , terminal 130 ), or a third-party system (eg, target attribute determination model)
  • the at least one trained target attribute determination model is obtained from the database system of the party).
  • the at least one target attribute determination model may be pre-trained by a computing device (eg, processing device 112) and stored in memory of system 100 (eg, storage device 140, memory 220, and/or memory 390).
  • the processing device 112 can access the memory and retrieve the at least one target attribute determination model.
  • the at least one target attribute determination model may be generated according to one or more machine learning algorithms.
  • the one or more machine learning algorithms may include, but are not limited to, artificial neural network algorithms, deep learning algorithms, decision tree algorithms, association rule algorithms, inductive logic programming algorithms, support vector machine algorithms, clustering algorithms, and Bayesian networks. Algorithms, reinforcement learning algorithms, representation learning algorithms, similarity metric learning algorithms, sparse dictionary learning algorithms, genetic algorithms, rule-based machine learning algorithms, etc., or any combination thereof.
  • the processing device 112 or another computing device may train the target attribute determination model according to a supervised learning algorithm.
  • the processing device 112 may acquire one or more first training samples and a first initial model.
  • Each first training sample may include sample dialog feature information of the sample user, sample user feature information, and sample attribute information or sample attribute information sequence of sample audio and video played to the user.
  • the first initial model to be trained may include one or more model parameters, such as the number of layers, the number of nodes, the first loss function, etc., or any combination thereof.
  • the first initial model may have initial parameter values for one or more model parameters prior to training.
  • the training of the first initial model may include one or more first iterative processes to iteratively update the model parameters of the first initial model based on the one or more first training samples until the first termination is satisfied in a certain iterative process conditions.
  • An exemplary first termination condition may be that the value of the first loss function obtained in a certain iterative process is less than a threshold, a certain number of iterative processes have been performed, and the first loss function converges such that the value obtained in the previous iterative process is The difference between the value of the first loss function and the value of the first loss function obtained in the current iteration process is within a certain threshold range, and the like.
  • the first loss function can be used to measure the audio and video attribute information and sample audio and video attribute information predicted by the first initial model in an iterative process, or the predicted audio and video attribute information sequence and the sample audio and video attribute information sequence. difference between.
  • the sample dialogue feature information and sample user feature information of the sample users of each first training sample can be input into the first initial model, and the first initial model can output the predicted audio and video of the first training sample Attribute information or a sequence of predicted audio and video attribute information.
  • the first loss function may be used to measure the difference between the predicted audio and video attribute information of each first training sample and the sample audio and video attribute information, or between the predicted audio and video attribute information sequence and the sample audio and video attribute information sequence.
  • Exemplary first loss functions may include focal loss functions, logarithmic loss functions, cross-entropy losses, and the like. If the first termination condition is not met during the current iteration, the processing device 112 may further update the first initial model for the next iteration according to a machine learning algorithm (eg, a backpropagation algorithm). If the first termination condition is satisfied in the current iteration process, the processing device 112 may designate the first initial model in the current iteration process as the target attribute determination model.
  • a machine learning algorithm eg, a backpropagation algorithm
  • the processing device 112 may input the dialog feature information and at least a portion of the user feature information into the at least one trained target attribute determination model.
  • the processing device 112 may input the dialog feature information determined in step 520 and the user feature information determined in step 530 into a target attribute determination model, and the target attribute determination model may output all audio and video attribute information Or a sequence of audio and video attribute information.
  • the processing device 112 may input the dialog feature information determined in step 520 and the user feature information determined in step 530 into a plurality of target attribute determination models, each target in the plurality of target attribute determination models The attribute determination model can output corresponding one or more items of audio and video attribute information, or one or more items of audio and video attribute information sequences.
  • the processing device may input the dialog feature information determined in step 520 and the user feature information determined in step 530 into the first target attribute determination model and the second target attribute determination model, respectively, and the first target attribute determination model may Output corresponding one or more pieces of audio and video attribute information or audio and video attribute information sequence, for example, semantic information, level of detail, difficulty of understanding, and the second target attribute determination model can output the remaining audio and video attribute information or attribute information sequence .
  • the processing device may preprocess at least a part of the dialogue feature information and the user feature information, generate a corresponding model input feature sequence, and input the model input feature sequence into the at least one target attribute Determine the model to obtain audio and video attribute information or audio and video attribute information sequence.
  • the preprocessing operation of the processing device 112 may include removing special characters irrelevant to the judgment of sentence semantics, normalizing and mapping some non-critical information in the dialogue into unified characters, and the like.
  • the processing device 112 may determine the at least one target attribute based on the output of the at least one trained target attribute determination model.
  • the output of the at least one target attribute determination model may be one or more items of audio and video attribute information or a sequence of audio and video attribute information.
  • the processing device 112 may obtain the output of the at least one target attribute determination model, and determine the at least one target attribute based on the obtained model output. For example, the processing device 112 may sort the acquired one or more pieces of audio and video attribute information, for example, according to importance. According to actual needs, the processing device 112 may further select one or more pieces of audio and video attribute information (eg, one or more items of audio and video attribute information at the top of the arrangement) as the at least one target attribute according to the arrangement order.
  • the processing device 112 may further select one or more pieces of audio and video attribute information (eg, one or more items of audio and video attribute information at the top of the arrangement) as the at least one target attribute according to the arrangement order.
  • processing device 112 may transmit at least a portion of the dialog characteristic information determined in step 520 and the user characteristic information determined in step 530 to another computing device (eg, a computing device of the supplier of the target attribute determination model).
  • the computing device may generate one or more items of audio and video attribute information based on the acquired dialogue feature information and user feature information, and send the generated one or more items of audio and video attribute information to the processing device 112 .
  • the processing device 112 may determine at least one target attribute based on the received one or more pieces of audiovisual attribute information.
  • process 600 may further include a step of obtaining the output of the model, or one or more steps of storing (eg, storing the input and output results of the model).
  • FIG. 7 is a flowchart of determining target audio and video based on a database according to some embodiments of the present application.
  • process 700 may be implemented by a set of instructions (eg, application program) to achieve.
  • the processor 220 and/or the modules in FIG. 4 may execute a set of instructions, and when the instructions are executed, the processor 220 and/or the modules may be configured to perform the process 700 .
  • the operation of the procedure shown below is for illustration purposes only. In some embodiments, process 700 may be accomplished using one or more additional operations not described and/or one or more operations not discussed herein. Additionally, the order in which the process operations shown in FIG. 7 and described below are not limiting.
  • the processing device 112 may acquire the database.
  • the database may be a database pre-produced and stored by the system 100 .
  • the database may also be a database obtained from an external source, for example, a database obtained from an external storage device based on the network 120.
  • the database may include a visual database, an audio database, a textual database, a picture database, etc., or a combination thereof.
  • the database can be used to provide candidate content of target audio and video.
  • the database can include multiple materials, such as text, images, audio, and video.
  • the plurality of materials may include one or more candidate audios, one or more candidate videos, one or more candidate texts, one or more candidate images, etc., or combinations thereof.
  • the candidate content in the plurality of materials may be determined by the processing device 112 through dialog characteristic information associated with the user. For example, the processing device 112 may determine candidate content based on keywords in the dialogue information. For example only, if the keyword is "system, how to use", the plurality of materials may include a plurality of candidate contents related to the use of the system.
  • the processing device 112 may determine the target audio and video based on the at least one target attribute and the database.
  • the processing device 112 may determine the target audio and video by matching each of the at least one target attribute with the material in the database, and selecting the target material based on the matching result.
  • the selected target material can be audio and video, and can be directly designated as the target audio and video.
  • the processing device may further adjust at least part of the basic properties of the target material in the form of audio and video according to the target properties to generate the target video, such as adjusting playback speed, playback sound, picture tone, and the like.
  • the processing device 112 may generate a new video as the target audio and video based on the target material.
  • the processing device 112 may generate target audio and video based on one or more pieces of target text in the target material.
  • the processing device 112 may acquire a text sequence of the above-mentioned one or more target texts, and then generate a corresponding speech sequence based on the text sequence as the target audio.
  • the processing device 112 may further generate a target video based on the target audio and one or more target pictures in the target material.
  • the processing device 112 may also determine the target audio and video based on various combinations of the above target materials, for example, generate the target video based on a plurality of target pictures and a target audio found from a database.
  • the processing device 112 may calculate a degree of matching of the basic attribute of each of the plurality of materials in the database with each target attribute.
  • the degree of match can be expressed as a number (eg, 1-10) or a rating (eg, high, medium, low).
  • the processing device 112 needs to calculate the degree of matching between the basic attributes of the candidate text and each target attribute.
  • the base attributes and target attributes may be represented by numerical values.
  • the processor may compare the difference between the value of the base attribute of the material and the value of the corresponding target attribute (eg, by determining ratio).
  • the processing device 112 may calculate the matching degree of the semantic information in the basic attribute of the candidate text with the semantic information in the target attribute, the matching degree of the detailed level in the basic attribute of the candidate text with the detailed level in the target attribute, and the like.
  • the processing device 112 may determine one or more matching scores based on the calculated degrees of matching of the basic attributes of each material with the target attributes.
  • the match score of a base attribute and a target attribute may be positively correlated (or negatively correlated) with the degree of match.
  • Match scores can be in numerical (eg, percentage) form, eg, 30%, 60%.
  • processing device 112 may select one or more target materials from the plurality of materials based on one or more match scores for each of the plurality of materials. For example, the processing device 112 may sum the matching scores of each material corresponding to the target attribute to obtain a total matching score, and then sort each material according to the total matching score.
  • the processing device 112 may average the matching scores of each material corresponding to the target attribute, and sort each material according to the average. For another example, different weight values can be assigned to multiple target attributes, each matching score is multiplied by the weight value of the corresponding target attribute, and then summed to obtain a total matching score, and each material is sorted according to the total matching score.
  • the processor 112 may further select corresponding target materials (eg, the top 20%, or the top three materials) according to the sorting result.
  • the processing device 112 may determine the target audio and video by adjusting the basic properties of the one or more target materials based on one or more target materials and at least one target attribute.
  • the one or more target materials may include one or more initial audio and video.
  • the processing device 112 can adjust at least some basic attributes of the initial audio and video according to at least one target attribute, for example, can adjust the playback speed, picture tone, sound tone and the like of the initial audio and video.
  • process 700 is provided for illustration purposes only, and is not intended to limit the scope of the present application. Numerous changes and modifications will occur to those of ordinary skill in the art under the teachings of this application. However, those changes and modifications do not depart from the scope of this application. In some embodiments, process 700 may be accomplished using one or more additional operations not described and/or one or more operations not discussed herein. Additionally or alternatively, the order of operations of the process 700 shown in FIG. 7 is not limiting.
  • FIG. 8 is another flowchart of determining target audio and video according to some embodiments of the present application.
  • process 800 may be implemented by a set of instructions (eg, application program) to achieve.
  • the processor 220 and/or the modules in FIG. 4 may execute a set of instructions, and when the instructions are executed, the processor 220 and/or the modules may be configured to perform the process 800 .
  • the operation of the procedure shown below is for illustration purposes only.
  • process 800 may be accomplished using one or more additional operations not described and/or one or more operations not discussed herein. Additionally, the order in which the process operations shown in FIG. 5 and described below are not limiting.
  • the processing device 112 may acquire the trained material determination model.
  • the material determination model may be a model for generating target material related to target audio and video, such as a machine learning model.
  • the material determination model may include a deep learning model, eg, a deep neural network (DNN) model, a convolutional neural network (CNN) model, a recurrent neural network (RNN) model, a feature map pyramid network (FPN) ) model, Seq2Seq model, Long Short Term Memory (LSTM) model, etc.
  • DNN deep neural network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • FPN feature map pyramid network
  • the material determination model may receive model input (eg, dialog feature information, and/or other user-related information), and the material determination model may output one or more objects related to the target audio and video material.
  • the one or more target materials related to the target audio and video may include one or more audios, one or more videos, one or more words, one or more images, and the like.
  • processing device 112 may obtain the data from one or more components of system 100 (eg, storage device 140, terminal 130), or a third-party system (eg, a database system of a supplier of material determination models).
  • the trained material determines the model.
  • the material determination model may be pre-trained by a computing device (eg, processing device 112) and stored in memory of system 100 (eg, storage device 140, memory 220, and/or memory 390).
  • Processing device 112 may access the memory and retrieve the material determination model.
  • the material determination model may be generated according to one or more machine learning algorithms described elsewhere in this application (eg, step 610 in FIG. 6 and its associated description).
  • the processing device 112 may train the material determination model according to a supervised learning algorithm.
  • the processing device 112 may acquire one or more second training samples and a second initial model.
  • Each second training sample may include sample dialogue feature information of the sample user and one or more sample audio and video materials.
  • the second initial model to be trained may include one or more model parameters, such as the number of layers, the number of nodes, the second loss function, etc., or any combination thereof.
  • the second initial model may have initial parameter values for one or more model parameters prior to training.
  • the training of the second initial model may include one or more second iterative processes to iteratively update the model parameters of the second initial model based on the one or more second training samples until a second termination is satisfied in a certain iterative process conditions.
  • An exemplary second termination condition may be that the value of the second loss function obtained in a certain iterative process is less than a threshold, a certain number of iterative processes have been performed, and the second loss function converges such that the value obtained in the previous iterative process is The difference between the value of the second loss function and the value of the second loss function obtained in the current iteration process is within a certain threshold range, and the like.
  • the second loss function may be used to measure the difference between the one or more audiovisual materials predicted by the second initial model and the corresponding sample audiovisual materials in an iterative process.
  • the sample dialog feature information of the sample users of each second training sample may be input into the second initial model, and the second initial model may output one or more predicted audio and video materials of the training sample.
  • the second loss function may be used to measure the difference between the one or more predicted audio and video materials for each training sample and the corresponding sample audio and video materials.
  • Exemplary second loss functions may include focal loss functions, logarithmic loss functions, cross-entropy losses, and the like.
  • the processing device 112 may further update the second initial model for the next iteration according to a machine learning algorithm (e.g., a backpropagation algorithm). If the second termination condition is satisfied in the current iteration process, the processing device 112 may designate the second initial model in the current iteration process as the material determination model.
  • a machine learning algorithm e.g., a backpropagation algorithm
  • processing device 112 may input the dialog feature information into the material determination model.
  • the processing device 112 may directly input the dialogue feature information determined in step 520 into a material determination model, and the material determination model may output one or more audio and video materials.
  • the processing device 112 may encode the dialogue feature information determined in step 520 to generate a dialogue feature information sequence, and input the dialogue feature sequence into the material determination model, and the material determination model may output corresponding Sequence of audio and video footage.
  • the processing device may preprocess at least a part of the dialogue feature information, and input at least a part of the preprocessed dialogue feature information into the material determination model to obtain one or more Audio and video material or audio and video material sequence.
  • the processing device 112 may perform one or more preprocessing operations, for example, preprocessing the dialogue feature information to generate a corresponding model input sequence, removing special characters irrelevant to the semantics of the judgment sentence, converting links in the dialogue, Information such as place names is normalized and mapped into uniform characters.
  • processing device 112 may determine an initial audiovisual based on the output of the material determination model.
  • the output of the material determination model may be one or more audio and video materials or sequences of audio and video materials.
  • the processing device 112 may obtain the output of the material determination model, and determine the initial audio and video based on the obtained model output. For example, if the material model directly outputs a complete audio and video, the processing device 112 designates the audio and video as the initial audio and video. For another example, if the material model outputs two or more audio and video clips, or a video sequence composed of multiple audio and video clips, the processing device 112 may convert the two or more audio and video clips, Or, multiple audio and video segments in the audio and video sequence are spliced in a certain order to generate the initial audio and video.
  • the processing device 112 may combine the one or more pictures and corresponding one or more audios to generate the Initial audio and video.
  • the processing device 112 may combine the one or more pictures and corresponding one or more words to generate the Initial audio and video.
  • the output of the material determination model may be a piece of text, and the processor 112 may convert the text into audio, and generate a video in combination with a picture containing an avatar to simulate a video conversation between the avatar and the user. This method is beneficial to improve the user's interest in dialogue and bring a good user experience to the user. For example, when the user is a child, the processor 112 may generate a video containing cartoon characters simulating a video conversation with the user.
  • processing device 112 may transmit at least a portion of the dialog characteristic information determined in step 520 to another computing device (eg, a computing device of a supplier of the material determination model).
  • the computing device may generate one or more audio and video materials based on the acquired dialogue feature information, and send the generated one or more audio and video materials to the processing device 112 .
  • the processing device 112 may determine the initial audio and video based on the received one or more audio and video materials.
  • the processing device 112 may generate the target audio and video by adjusting the basic properties of the initial audio and video based on the at least one target attribute.
  • the processing device 112 eg, the determination module 420
  • the processing device 112 may refer to the at least one target attribute to determine the basic attribute of the initial audio and video that needs to be adjusted.
  • the processing device 112 may further determine the adjustable range of the basic property that needs to be adjusted.
  • the processing device 112 may further adjust the basic attribute to be adjusted based on the adjustable range of the basic attribute to be adjusted and the corresponding target attribute, so that the adjusted basic attribute is consistent with or similar to the corresponding target attribute.
  • the adjustable range of the basic attribute means that the basic attribute can be adjusted within a certain range.
  • the at least one target attribute includes the level of detail as detailed, the difficulty of understanding as simple, the playback speed as slow, and the playback sound as
  • the basic attributes of the initial audio and video include: detailed level of detail, easy to understand, fast playback speed, playback sound of a child's voice, and warm tone of the picture.
  • the processing device 112 may determine that the basic properties to be adjusted include playback speed and playback sound, and adjust accordingly.
  • the processing device 112 may designate the adjusted initial audio and video as the target audio and video. If the basic attribute of the initial audio and video matches each of the at least one target attribute, the processing device 112 may directly designate the initial audio and video as the target audio and video.
  • process 800 may further include a sending step of sending the target audio and video to the target terminal, or one or more storage steps (eg, storing the initial audio and video and the target audio and video).
  • FIG. 9 is a schematic diagram of interaction between a terminal and a server according to some embodiments of the present application.
  • the interactive process 900 and the exemplary steps shown in FIG. ) in a set of instructions for example, an application.
  • the processor 220 and/or the modules in FIG. 4 may execute a set of instructions, and when the instructions are executed, the processor 220 and/or the modules may be configured to perform the interactive process 900 .
  • the operation of the procedure shown below is for illustration purposes only.
  • the interactive process 900 may be accomplished using one or more additional operations not described and/or one or more operations not discussed herein.
  • the interaction process 900 is only used as an example to illustrate the entire application process of the human-computer interaction of the present application, and is not intended to limit the present application.
  • the interaction process 900 can be applied to various application scenarios of intelligent human-machine dialogue, including but not limited to users and intelligent customer service robots, intelligent speakers, chat robots, and smart home devices (eg, smart TVs, smart air conditioners, and smart sweeping/mopping devices). ), smart vehicles, and application scenarios that communicate with web pages or APPs on the terminal.
  • a user may interact with the server 110 (or the processing device 112 in the server 110) through the terminal 130 (eg, a user interface on the terminal 130).
  • the user may ask a question through the user interface on the terminal 130, the processing device 112 may answer the user's question by generating the target audio and video, and the user may also provide user feedback on the target audio and video, thereby optimizing the target audio and video content, better serve users.
  • steps 901 , 905 , 908 , 909 , and 9011 are executed by the terminal 130
  • steps 902 , 903 , 904 , 906 , 907 , 9010 , 9012 , and 9013 are executed by the server 110 .
  • the user may receive through the terminal 130 the user-related dialogue information (or simply referred to as dialogue information) input by the user.
  • “User-related dialog information” refers to the dialog information sent by the user through the terminal and/or the dialog information received by the user through the terminal.
  • the dialogue information sent by the user includes, but is not limited to, the form of voice, text, and pictures.
  • the terminal 130 may transmit it to the server 110 (eg, the processing device 112 ) through the network 120 , so as to obtain the dialog information.
  • the terminal 130 as an intelligent customer service robot as an example, the user can enter dialogue information (or simply referred to as dialogue information) related to the user through the user interface of the intelligent customer service robot to conduct a dialogue with the intelligent customer service robot.
  • the processing device 112 may acquire the latest dialog information and/or contextual dialog information sent by the user from the terminal 130 .
  • the latest dialogue information sent by the user may include one or more words or words, a sentence, a paragraph, one or more voice messages, one or more pictures, and the like.
  • the dialog information newly issued by the user may include declarative sentences (for example, "Hello"), question sentences (for example, "how the system is used"), and the like.
  • the contextual dialog information may include continuous information sent and received by the user through the terminal prior to sending the latest message.
  • the processing device 112 may analyze the information sent by the user, and send feedback information (such as text, voice, pictures, audio and video, etc.) to the terminal.
  • the user can continue to send new information based on the feedback information.
  • the new information is the latest dialogue information sent by the user
  • the contextual dialogue information includes the information previously sent by the user and the above-mentioned feedback information (also called two-round continuous dialogue information).
  • the contextual dialogue information may include the latest multi-round dialogue information, or may include all dialogue information.
  • the number of rounds of dialog information may be preset in the system 100 .
  • the processing device 112 may determine dialog characteristic information of the dialog information.
  • the dialogue feature information may include keywords, emotions, etc. in the dialogue information.
  • the processing device 112 may determine the dialog characteristic information from the dialog information. For example, the processing device 112 may determine the conversation characteristics from the most recently issued conversation information.
  • the processing device 112 may extract keywords in the dialogue information according to a keyword extraction technique. Exemplary keyword extraction techniques may include, but are not limited to, Topic model, TFIDF, TextRank, RAKE, etc. techniques.
  • the processing device 112 may convert the speech information into text through speech recognition technology, and then perform dialogue feature information extraction on the converted text.
  • the processing device 112 may identify the dialogue feature information in the picture in the dialogue information through image recognition technology. For example, the processing device 112 may recognize the text in the picture, and identify the keyword through the text content. For another example, the processing device 112 may identify features in the picture, thereby judging the emotion expressed by the picture. For example, if the picture sent by the user is an angry expression, the processing device 112 may extract image features from the picture, so as to determine the emotional content.
  • processing device 112 may determine dialog feature information based on contextual dialog information. By analyzing the contextual dialogue information and performing semantic recognition, the dialogue feature information can be determined. Specifically, the processing device 112 may determine the dialogue feature information through a hierarchical model or a non-hierarchical model.
  • the user may also input user characteristic information through the terminal.
  • the user feature information may include user personal information, user preference information, other user information (eg, user hobbies), and the like.
  • User personal information may include age, gender, education background, work background, health status, home address, marital status, educational background, etc. or a combination thereof.
  • the user may input via voice input, text input, or the like.
  • the terminal may provide an expression of the user's personal information for the user to fill in.
  • the processing device 112 may obtain user personal information from the terminal through the network 120 .
  • an APP may be installed on the terminal, and the user interacts with the APP through the terminal and needs to log in to the APP.
  • the processing device 112 may obtain the user ID of the user through the terminal, and obtain the user personal information corresponding to the user ID from the storage device 140 through the network 120 .
  • the user's preference information may include the user's preference settings, the user's current mood, or historical user feedback information provided by the user for historical audio and video in the past.
  • the preference settings may include the user's preference for the playback speed of the target audio and video (eg, slow, normal, fast), the user's preference for the playback sound of the target audio and video (eg, female voice, male voice), the user's preference for the target audio and video playback content preferences (for example, concise, detailed), user preferences for target audio and video playback quality (for example, Blu-ray, high-definition, standard-definition), etc.
  • the user may provide the user's preference settings at the terminal, for example, the terminal 130 may provide a preference setting selection page for the user to select (for example, provided by the above-mentioned APP).
  • the user's preference settings may be saved in storage device 140 .
  • the processor 112 may obtain the user preference information corresponding to the user ID from the storage device 140 through the network 120 based on the user ID used by the user to log in to the APP.
  • the user's current emotions may include happiness, liking, sadness, surprise, anger, fear, and disgust. Alternatively, the user's emotions can be categorized as positive, negative, neutral, etc.
  • the processing device 112 may identify the user's current emotion from the dialog information. For example, when the dialog information related to the user is text information, the processing device 112 may identify the current emotion of the user through text sentiment analysis technology. Exemplary textual sentiment analysis techniques may include, but are not limited to, keyword extraction rule-based techniques, machine learning model-based techniques, and the like, or combinations thereof.
  • the processing device 112 may obtain a list of emotional keywords (eg, positive words, negative words or words expressing anger, words expressing happiness, words expressing sadness, etc.), extract the emotional keywords from the text, and compare the emotional keyword list Comparisons are made to determine the sentiment expressed by the text.
  • a list of emotional keywords eg, positive words, negative words or words expressing anger, words expressing happiness, words expressing sadness, etc.
  • the historical user feedback information provided by the user for historical audio and video in the past may include the user's feedback on the content of the audio and video, the feedback on the audio and video playback sound, the feedback on the audio and video playback speed, the feedback on the video playback quality, the feedback on the audio and video playback Feedback of the playback process, etc., or any combination thereof.
  • the terminal 130 may store the user's operations on the historical audio and video, and the processing device 112 may obtain the user's operation on the historical audio and video by accessing the terminal 130, thereby judging the historical user feedback.
  • the terminal 130 may store operations such as pause, playback, fast-forward, etc.
  • the processing device 112 may determine the user's playback of the audio and video by acquiring the user's operations such as pause, playback, and fast-forward. content feedback.
  • the terminal 130 may store the user's operation of adjusting the playback speed of the historical audio and video, and the processing device 112 may obtain the user's operation of adjusting the playback speed of the historical audio and video to determine the feedback on the audio and video playback speed.
  • the terminal 130 may store the user's operation of adjusting the playback speed of the historical video, and the processing device 112 may determine the feedback on the video playback image quality through the user's image quality adjustment operation on the historical video.
  • the processing device 112 may determine at least one target attribute corresponding to the target audio and video based on the dialogue feature information and the user feature information.
  • the target attribute may include, but is not limited to, semantic information, level of detail, difficulty of understanding, playback speed, timbre of playback sound, playback picture quality, picture tone, etc., or a combination thereof.
  • the target attribute includes at least semantic information.
  • the semantic information refers to the meaning to be expressed by the feedback information determined by the processing device 112 according to the user's dialogue information (eg, a question).
  • the processing device 112 may determine semantic information from the extracted dialogue feature information.
  • semantic information may be expressed in the form of keywords. For example, the keyword extracted based on the latest dialogue information sent by the user is "hello", and the determined semantic information may be "hello".
  • target attributes may be determined based on user characteristic information through attribute determination rules. Attribute determination rules can be used to determine all target attributes or individual target attributes.
  • attribute determination rules may include comparing individual user characteristic information with preset reference information (eg, categories or thresholds) to determine target attributes.
  • preset reference information eg, categories or thresholds
  • Single or multiple target attributes can be determined from a single user characteristic information, or a single target attribute can be determined from multiple user characteristics.
  • multiple target attributes can be determined according to the user's age compared with an age threshold. If it is greater than the first age threshold (eg, 60 years old), the level of detail is detailed, the difficulty of understanding is simple, the playback speed is slow, and the timbre is Normally, the screen tone is a calm tone such as a cool tone.
  • the comprehension difficulty can be determined according to the comparison between the user's educational background and the educational degree category. If the educational background belongs to elementary school students, the comprehension difficulty is low.
  • the attribute determination rule may include combining multiple user feature information and comparing with preset reference information to determine a single target attribute.
  • weights may be assigned to a plurality of user feature information and combined in a weighted manner. For example, the three features of user age, user education, and user emotion can be combined with weights of 0.3, 0.4, and 0.3, respectively. If the user is between 18 and 50 years old, the comprehension difficulty score corresponding to the age is 0.5 (or moderate). If the education is high school, the comprehension difficulty score corresponding to the education is 0.3 (or lower). Emotions are irritable or impatient, and the comprehension difficulty score corresponding to emotions is 0.3 (or low).
  • a numerical value eg, the total comprehension difficulty score
  • a category may be used to reflect target attributes. For example, if the total score is low, the detail level may be low.
  • the processing device 112 may also determine the target attribute based on the machine learning model. Specifically, the processing device 112 may acquire at least one trained target attribute determination model; input at least a part of the dialogue feature information and the user feature information into the at least one trained target attribute determination model; and based on the At least one trained target attribute determines the output of the model, and the at least one target attribute is determined.
  • the processing device 112 may acquire at least one trained target attribute determination model; input at least a part of the dialogue feature information and the user feature information into the at least one trained target attribute determination model; and based on the At least one trained target attribute determines the output of the model, and the at least one target attribute is determined.
  • FIG. 6 For the related content of using the machine learning model to determine the target attribute, reference may be made to FIG. 6 and its description, which will not be repeated here.
  • the processing device 112 can determine whether the target audio and video can be determined based on at least one target attribute.
  • the processing device 112 can automatically determine the target audio and video based on at least one target attribute.
  • the user can be provided with personalized target audio and video content, which can provide the user with a better viewing experience, help the user to answer questions, and/or improve the user's human-computer interaction experience.
  • the processing device 112 may retrieve the database from the storage device 140 .
  • the database may include a plurality of materials, such as at least one of one or more candidate audios, one or more candidate videos, one or more candidate texts, and one or more candidate images.
  • Each material in the database has at least one basic attribute, such as semantic information, level of detail, difficulty of understanding, emotional attributes, etc.
  • the processing device 112 may select a target audio/video matching the target attribute from the database based on at least one target attribute.
  • the dialogue information sent by the user is "how to use this system?”; the dialogue feature information extracted by the processor can be "system” and “how to use”; the processor 112 obtains the user's preference settings through the network 120, and the user The preference of the audio and video playback content is set to detailed; then the target audio and video determined by the processor 112 may be a detailed version of the video on how to use the system.
  • the processor 112 may also generate the target audio and video based on the above-mentioned one or more non-audio and video materials matching the target attributes. For example, the processor 112 may generate speech based on the text and synthesize the text, images and speech into the target video. To determine the content of the target audio and video according to the database, reference may be made to FIG. 7 and its description, which will not be repeated here.
  • the processing device 112 may determine the target audio and video based on at least one target attribute according to the machine learning model. For the content of determining the target audio and video according to the machine learning model, reference may be made to FIG. 8 and its description, which will not be repeated here.
  • the target audio and video may be a single video.
  • the target audio and video may include multiple segments arranged in sequence.
  • the above-mentioned multiple segments can be played in sequence.
  • multiple segments may be segments of similar content.
  • the processor 112 may find a plurality of audios and videos matching the target attributes from the database, and sort the plurality of audios and videos according to certain rules to form a complete target audio and video. For example, the processor 112 may sort the above-mentioned multiple audios and videos according to the matching degree of the audios and videos with the target attribute.
  • the processor 112 may sort the plurality of audios and videos according to the value of a certain item of the target attribute, for example, according to the value of understanding difficulty in ascending order (that is, the understanding difficulty of the plurality of audios and videos increases).
  • the processing device 112 may directly generate the target audio and video including multiple segments according to a machine learning model (eg, the material determination model described in FIG. 8 ).
  • the processing device 112 may automatically generate material in a non-audio/video format.
  • the processing device 112 may directly determine that it is not necessary to determine the target audio and video, and thus execute step 907 .
  • the processor 112 may directly send the text or picture matching the target attribute to the terminal to conduct a dialogue with the user.
  • the terminal 130 may receive the automatically generated target audio-video or non-audio-video material through the network 120 .
  • the user may play the target audio and video through the terminal 130 (eg, click).
  • the terminal 130 can also be set to automatically play the target audio and video.
  • the terminal 130 may receive user feedback provided by the user.
  • User feedback provided by the user may include the number of pauses, the duration of the pause, the number of playbacks, the playback duration, the number of fast forwards, the number of fast forwards, the number of slow playbacks, the slow playback duration, whether to ask a new question, whether to end the playback early, etc., or a combination thereof .
  • These user feedbacks may indicate the user's ability to understand and absorb the target audio and video, and/or the user's acceptance level (eg, like or dislike) the target audio and video.
  • the processing device 112 can obtain user feedback through the operation performed by the user on the terminal 130 on the target audio and video.
  • the terminal 130 can determine the pause operation, playback operation, fast-forward operation, slow-play operation, close operation, and operation of sending a new message performed by the user on the target audio and video on the terminal 130, and then transmit the operation to the processing device 112 through the network 120.
  • the terminal 130 may detect the user's operation of some buttons on the user interface and/or the user's gesture operation on the user interface.
  • the terminal 130 can detect the pause button, playback, fast forward, slow play button, and close button clicked by the user on the target audio and video, and determine to execute the pause operation, playback operation, fast forward operation, slow playback operation, and close operation. For example, if the number of fast-forwards of the target audio and video by the user is greater than a certain threshold (for example, 3 times), this may reflect that the target audio and video content is too simple and easy for the user to understand, or the user does not like the content of the target audio and video, or the user Preference for faster playback speed, etc.
  • a certain threshold for example, 3 times
  • the processing device 112 can automatically adjust the unplayed segments of the target audio and video. For example, in response to the user providing user feedback during the playback of the target audio/video, the processing device 112 may automatically adjust at least one of the one or more segments of the target audio/video based on the user feedback Basic properties of unplayed clips. By determining whether the user provides user feedback, the processor 112 can further optimize the content of the target audio and video to provide the user with a better viewing effect and improve the user experience.
  • the base attribute of the unplayed segment may correspond to the target attribute.
  • Base attributes can include only some of the target attributes, eg, level of detail. Base attributes can also include all target attributes.
  • the basic attributes may include semantic information, level of detail, difficulty of understanding, playback speed, playback sound, playback picture quality, picture tone, etc., or a combination thereof.
  • the processing device 112 may adjust basic properties of at least one unplayed segment in one or more segments of the target audio and video according to the user feedback.
  • the adjusted base attributes may be determined through attribute adjustment rules. Attribute adjustment rules may be similar to attribute determination rules.
  • the attribute adjustment rules may include whether a single type of feedback in the user feedback is greater than a threshold.
  • a threshold eg, 3
  • the playback speed of at least one unplayed segment of the target audio and video may be adjusted, for example, the playback speed may be adjusted to 1.5 times the playback speed.
  • the terminal 130 can receive the adjusted target audio and video through the network 120.
  • the adjusted target audio and video may cover the original target audio and video.
  • the adjusted target audio and video may be newly generated target audio and video.
  • the terminal 130 may also give a prompt of the adjusted target audio and video. For example, display the basic properties of the adjusted target audio and video (for example, display "The playback speed has been adjusted to 1.5 times").
  • the user can also provide user feedback on the adjusted target audio and video through the terminal 130, and the terminal 130 can continue to receive user feedback, so as to perform the steps in the process 9010-9011 again until the user does not Provide user feedback.
  • the processor 112 can determine whether the client needs to send additional dialog information. For example, the terminal 130 may display a dialog box on its interface, asking the user whether other dialog information needs to be sent. For example, "Do you have any other questions?" may be displayed on the interface. If so, the content in step 901 is executed again, and the user may further input dialog information. If not, the interaction process ends.
  • the interface can display closing remarks, for example, "This conversation is over, thank you".
  • interaction process 900 is for illustrative purposes and is not intended to limit the scope of protection of the present application. Numerous variations and modifications will occur to those skilled in the art under the guidance of this application. However, these variations and modifications do not depart from the scope of protection of the present application.
  • All or part of the software may sometimes communicate over a network, such as the Internet or other communication network.
  • a network such as the Internet or other communication network.
  • Such communications enable the loading of software from one computer device or processor to another.
  • a hardware platform loaded from a management server or host computer of a radiation therapy system to a computer environment, or other computer environment implementing the system, or a system of similar functionality related to providing the information needed to determine the target structure parameters of the wheelchair. Therefore, another medium capable of transmitting software elements can also be used as a physical connection between local devices, such as light waves, radio waves, electromagnetic waves, etc., through cables, optical cables or air.
  • the physical medium used for the carrier wave such as a cable, wireless connection, or fiber optic cable, etc., can also be considered to be the medium that carries the software.
  • tangible "storage” media other terms referring to computer or machine "readable media” refer to media that participate in the execution of any instructions by a processor.
  • the computer program coding required for the operation of the various parts of this application may be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python Etc., conventional procedural programming languages such as C language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code may run entirely on the user's computer, or as a stand-alone software package on the user's computer, or partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any form of network, such as a local area network (LAN) or wide area network (WAN), or to an external computer (eg, through the Internet), or in a cloud computing environment, or as a Service usage such as Software as a Service (SaaS).
  • LAN local area network
  • WAN wide area network
  • SaaS Software as a Service

Abstract

本申请实施例公开了一种确定目标音视频的方法。该方法在计算设备上实现,计算设备具有至少一个处理器和至少一个存储设备。该方法包括以下步骤:获取与用户相关的对话信息;确定所述对话信息的对话特征信息;获取所述用户的用户特征信息;基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及基于所述至少一项目标属性,确定所述目标音视频。

Description

一种确定目标音视频的方法及系统 技术领域
本申请涉及音视频处理领域,特别涉及一种确定目标音视频的方法及系统。
背景技术
目前,人机对话技术已经广泛进入人类的日常生活中,可应用于智能客服机器人、智能音响、聊天机器人、智能家居等。通过人机语音对话方式可以实现用户需求。例如,用户可以通过智能客户机器人来解答问题。但是目前智能客服机器人与用户之间的对话方式为纯文本对话,或语音、文本、图片等形式相结合的即时消息方式。用户对于反馈的消息需要较高的理解和学习能力,用户往往不能直接基于反馈的消息解决问题。因此,期望提供一种智能生成音视频的方法和系统,能够通过音视频更直观地解答用户的问题或与用户进行对话。
发明内容
本申请的一个方面提供一种确定目标音视频的方法。该方法可以在计算设备上实现,所述计算设备可以具有至少一个处理器和至少一个存储设备。所述方法可以包括以下步骤:获取与用户相关的对话信息;确定所述对话信息的对话特征信息;获取所述用户的用户特征信息;基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;基于所述至少一项目标属性,确定所述目标音视频。
在一些实施例中,基于所述至少一项目标属性,确定所述目标音视频可以包括:获取数据库,所述数据库包括多个素材,所述多个素材包括以下中的至少一项素材:一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像;以及基于所述至少一项目标属性和所述数据库,确定所述目标音视频。
在一些实施例中,基于所述至少一项目标属性,确定所述目标音视频可以进一步包括:对于所述至少一项目标属性中的每一项,计算所述数据库中多个素材中的每个素材与所述目标属性的匹配度;对于所述数据库中多个素材中的每个素材,基于所述素材的对应于所述至少一项目标属性的至少一个匹配度,确定匹配总分;基于所述数据库中多个素材对应的多个匹配分数,从所述多个素材中,基于所述匹配分数,选择一个或多个目标素材;以及基于所述一个或多个目标素材,确定所述目标音视频。
在一些实施例中,基于所述至少一项目标属性,确定所述目标音视频可以进一步包括:基于所述一个或多个目标素材和所述至少一项目标属性,通过对所述一个或多个目标素材的基本属性进行调整来生成所述目标音视频。
在一些实施例中,所述目标音视频的至少一项目标属性可以包括内容属性、详细程度、 理解难度、播放速度、画面色调或音色中的一项或多项。
在一些实施例中,基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性可以包括:获取至少一个训练后的目标属性确定模型;将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型;以及基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。
在一些实施例中,基于所述至少一项目标属性,确定所述目标音视频可以包括:获取训练后的素材确定模型;将所述对话特征信息输入所述素材确定模型;基于所述素材确定模型的输出,确定初始音视频;以及基于所述至少一项目标属性,通过对所述初始音视频的基本属性进行调整来生成所述目标音视频。
在一些实施例中,所述目标音视频可以包括一个或多个片段。
在一些实施例中,所述方法可以进一步包括:确定在所述目标音视频播放的过程中,用户是否提供了用户反馈;以及响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,基于所述用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。
在一些实施例中,在所述目标音视频播放的过程中,所述用户提供的所述用户反馈可以包括以下中的一项或多项:暂停次数、暂停时长、回放次数、回放时长、快进次数、快进时长、慢播次数、慢播时长、是否提出新的问题以及是否提前结束播放。
在一些实施例中,所述用户特征信息可以包括用户个人信息,所述用户个人信息包括以下中的一项或多项:年龄、性别、学历、工作背景以及健康状况。
在一些实施例中,用户特征信息可以包括用户的偏好信息,所述用户的偏好信息可以包括所述用户的偏好设置、所述用户当前的情绪或所述用户过去针对历史音视频提供的历史用户反馈信息中的至少一项。
本申请的一个方面提供一种确定目标音视频的系统。所述系统可以包括:用于存储计算机指令的至少一个存储器;与所述存储器通讯的至少一个处理器,其中当所述至少一个处理器执行所述计算机指令时,所述至少一个处理器使所述系统执行:获取与用户相关的对话信息;确定所述对话信息的对话特征信息;获取所述用户的用户特征信息;基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及基于所述至少一项目标属性,确定所述目标音视频。
在一些实施例中,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器可以使所述系统进一步执行:获取数据库,所述数据库包括多个素材,所述多个素 材包括以下中的至少一项素材:一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像;以及基于所述至少一项目标属性和所述数据库,确定所述目标音视频。
在一些实施例中,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器可以使所述系统进一步执行:对于所述至少一项目标属性中的每一项,计算所述数据库中多个素材中的每个素材与所述目标属性的匹配度;对于所述数据库中多个素材中的每个素材,基于所述素材的对应于所述至少一项目标属性的至少一个匹配度,确定匹配总分;基于所述数据库中多个素材对应的多个匹配分数,从所述多个素材中,基于所述匹配分数,选择一个或多个目标素材;以及基于所述一个或多个目标素材,确定所述目标音视频。
在一些实施例中,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器可以使所述系统进一步执行:基于所述一个或多个目标素材和所述至少一项目标属性,通过对所述一个或多个目标素材的基本属性进行调整来生成所述目标音视频。
在一些实施例中,所述目标音视频的至少一项目标属性可以包括内容属性、详细程度、理解难度、播放速度、画面色调或音色中的一项或多项。
在一些实施例中,为基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性,所述至少一个处理器可以使所述系统进一步执行:获取至少一个训练后的目标属性确定模型;将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型;以及基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。
在一些实施例中,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器可以使所述系统进一步执行:获取训练后的素材确定模型;将所述对话特征信息输入所述素材确定模型;基于所述素材确定模型的输出,确定初始音视频;以及基于所述至少一项目标属性,通过对所述初始音视频的基本属性进行调整来生成所述目标音视频。
在一些实施例中,所述目标音视频可以包括一个或多个片段。
在一些实施例中,所述至少一个处理器可以使所述系统进一步执行:确定在所述目标音视频播放的过程中,用户是否提供了用户反馈;以及响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,基于所述用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。
在一些实施例中,在所述目标音视频播放的过程中,所述用户提供的所述用户反馈可以包括以下中的一项或多项:暂停次数、暂停时长、回放次数、回放时长、快进次数、快进 时长、慢播次数、慢播时长、是否提出新的问题以及是否提前结束播放。
在一些实施例中,所述用户特征信息可以包括用户个人信息,所述用户个人信息可以包括以下中的一项或多项:年龄、性别、学历、工作背景以及健康状况。
在一些实施例中,所述用户特征信息可以包括用户的偏好信息,所述用户的偏好信息可以包括所述用户的偏好设置、所述用户当前的情绪或所述用户过去针对历史音视频提供的历史用户反馈信息中的至少一项。
本申请的一个方面提供一种确定目标音视频的装置。所述装置可以包括:获取模块,用于获取与用户相关的对话信息;确定模块,用于确定所述对话信息的对话特征信息;获取模块,用于获取所述用户的用户特征信息;确定模块,用于基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及确定模块,用于基于所述至少一项目标属性,确定所述目标音视频。
本申请的一个方面提供一种计算机可读存储介质。所述存储介质存储计算机指令,当计算机读取存储介质中的计算机指令后,计算机执行方法。所述方法可以包括:获取与用户相关的对话信息;确定所述对话信息的对话特征信息;获取所述用户的用户特征信息;基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及基于所述至少一项目标属性,确定所述目标音视频。
附图说明
本申请将以示例性实施例的方式进一步说明,这些示例性实施例将通过附图进行详细描述。这些实施例并非限制性的,在这些实施例中,相同的编号表示相同的结构,其中:
图1是根据本申请的一些实施例所示的确定目标音视频的系统的场景图;
图2是根据本申请的一些实施例所示的示例性计算设备的示例性硬件和/或软件组件的示意图;
图3是根据本申请的一些实施例所示的示例性终端设备的示意图;
图4是根据本申请的一些实施例所示的示例性处理设备的框图;
图5是根据本申请的一些实施例所示的确定目标音视频的示例性流程图;
图6是根据本申请一些实施例所示的确定与目标音视频对应的至少一项目标属性的流程图;
图7是根据本申请一些实施例所示的基于数据库确定目标音视频的流程图;
图8是根据本申请一些实施例所示的另一确定目标音视频的流程图;以及
图9是根据本申请一些实施例所示的终端与服务器进行交互的示意图。
具体实施方式
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本申请的一些示例或实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本申请应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构或操作。
应当理解,本文使用的“系统”、“装置”、“单元”和/或“模块”是用于区分不同级别的不同组件、元件、部件、部分或装配的一种方法。然而,如果其他词语可实现相同的目的,则可通过其他表达来替换所述词语。
如本申请和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其它的步骤或元素。
本申请中使用了流程图用来说明根据本申请的实施例的系统所执行的操作。应当理解的是,前面或后面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各个步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
本申请披露了一种确定目标音视频的系统和方法。该方法可以包括获取与用户相关的对话信息;确定所述对话信息的对话特征信息;获取所述用户的用户特征信息;基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及基于所述至少一项目标属性,确定所述目标音视频。如本文中所使用的,术语“音视频”又称为A/V,表示音频或视频。通过用户特征信息以及目标属性,可以针对用户的需求提供目标音视频,从而给用户更好的体验。该方法可以进一步包括基于用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。根据用户在观看目标音视频的过程中的用户反馈,能够调整目标音视频的基本属性(例如,播放速度等),给用户带来更好的观看体验。
图1是根据本申请的一些实施例所示的示例性确定目标音视频的系统的示意图。在一些实施例中,确定目标音视频的系统100(或简称为系统100)可以是用于人机对话的系统。例如,系统100可以应用于各种智能人机对话设备,包括但不限于智能客服机器人、智能音响、聊天机器人、智能家居设备(例如,智能电视、智能空调、智能扫地/拖地设备)、智能交通工具等。系统100还可以结合终端上的网页或APP为用户提供交互服务,例如系统100中的服务器110可以通过APP上的智能客服系统,为用户解答问题。本申请并不对此进行限 制。在一些实施例中,系统100可包括服务器110、网络120、终端130和存储设备140。
在一些实施例中,服务器110可以是单个服务器,也可以是服务器组。所述服务器组可以是集中式的,也可以是分布式的(例如,服务器110可以是分布式的系统)。在一些实施例中,服务器110可以是本地的,也可以是远程的。例如,服务器110可以经由网络120访问存储在终端130和/或存储设备140中的信息和/或数据。又例如,服务器110可以直接连接到终端130和/或存储设备140以访问存储信息和/或数据。在一些实施例中,服务器110可以在云平台上实施。仅作为示例,该云平台可以包括私有云、公共云、混合云、社区云、分布云、内部云、多层云等或其任意组合。在一些实施例中,服务器110可以在包括图2中所示的一个或以上组件的计算设备200上实现。
在一些实施例中,服务器110可包括处理设备112。处理设备112可以处理与确定目标音视频相关的信息和/或数据以执行本申请描述的一个或以上功能。例如,处理设备112可以基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性。处理设备112还可以基于所述至少一项目标属性,确定所述目标音视频。在一些实施例中,所述处理设备112可包括一个或以上处理引擎(例如,单核处理引擎或多核处理引擎)。处理设备112可以包括中央处理单元(CPU)、专用集成电路(ASIC)、专用指令集处理器(ASIP)、图形处理单元(GPU)、物理处理单元(PPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)、可编程逻辑器件(PLD)、控制器、微控制器单元、精简指令集计算机(RISC)、微处理器等,或其任意组合。在实施例中,处理设备112可以集成在终端130中。
网络120可以促进信息和/或数据的交换。在一些实施例中,系统100的一个或以上组件(例如,服务器110、终端130或存储设备140)可以经由网络120将信息和/或数据发送到系统100的其他组件。例如,服务器110可以经由网络120从终端130获得与用户相关的对话信息。在一些实施例中,网络120可以是有线网络或无线网络等或其任意组合。仅作为示例,网络120可以包括电缆网络、有线网络、光纤网络、电信网络、内部网络、互联网、局域网络(LAN)、广域网络(WAN)、无线局域网络(WLAN)、城域网(MAN)、公共交换电话网络(PSTN)、蓝牙网络、紫蜂网络、近场通信(NFC)网络等或其任意组合。在一些实施例中,网络120可以包括一个或以上网络接入点。例如,网络120可以包括有线或无线网络接入点,如基站和/或互联网交换点120-1、120-2、……。通过接入点,确定目标音视频的系统100的一个或以上部件可以连接到网络120以交换数据和/或信息。
在一些实施例中,用户可以是使用终端130的个体。用户可以使用终端130来进行对话、观看目标音视频等。在一些实施例中,终端130可以包括移动设备130-1、平板计算机 130-2、膝上型计算机130-3、车载设备130-4等或其任意组合。在一些实施例中,移动设备130-1可以包括智能家居设备、可穿戴设备、智能移动设备、虚拟现实设备、增强现实设备、智能客户机器人、聊天机器人、智能交通工具等,或其任意组合。在一些实施例中,智能家居设备可以包括智能照明设备、智能电器控制装置、智能监控设备、智能电视、智能摄像机、对讲机、智能音响、智能扫地/拖地设备等,或其任意组合。在一些实施例中,该可穿戴设备可包括智能手镯、智能鞋袜、智能眼镜、智能头盔、智能手表、智能衣服、智能背包、智能配件等或其任意组合。在一些实施例中,智能移动设备可以包括智能电话、个人数字助理(PDA)、游戏设备、导航设备、销售点(POS)等,或其任意组合。在一些实施例中,虚拟现实设备和/或增强型虚拟现实设备可以包括虚拟现实头盔、虚拟现实眼镜、虚拟现实眼罩、增强现实头盔、增强现实眼镜、增强现实眼罩等,或其任意组合。例如,虚拟现实设备和/或增强现实设备可以包括Google Glass TM、Oculus Rift TM、Hololens TM或Gear VR TM等。在一些实施例中,车载设备130-4可以包括车载计算机、车载电视等。
存储设备140可以存储与确定目标音视频有关的数据和/或指令。在一些实施例中,存储设备140可以存储从终端130获得的数据。在一些实施例中,存储设备140可以存储服务器110可以执行或使用的以执行本发明中描述的示例性方法的数据和/或指令。在一些实施例中,存储设备140可包括大容量存储器、可移动存储器、易失性读写内存、只读内存(ROM)等或其任意组合。示例性的大容量存储器可以包括磁盘、光盘、固态磁盘等。示例性可移动存储器可以包括闪存驱动器、软盘、光盘、内存卡、压缩盘、磁带等。示例性易失性读写内存可以包括随机存取内存(RAM)。示例性RAM可包括动态随机存取内存(DRAM)、双倍数据速率同步动态随机存取内存(DDR SDRAM)、静态随机存取内存(SRAM)、晶闸管随机存取内存(T-RAM)和零电容随机存取内存(Z-RAM)等。示例性ROM可以包括掩模型只读内存(MROM)、可编程只读内存(PROM)、可擦除可编程只读内存(EPROM)、电可擦除可编程只读内存(EEPROM)、光盘只读内存(CD-ROM)和数字多功能磁盘只读内存等。在一些实施例中,所述存储设备140可以在云平台上实现。仅作为示例,该云平台可以包括私有云、公共云、混合云、社区云、分布云、内部云、多层云等或其任意组合。
在一些实施例中,存储设备140可以连接到网络120,以与确定系统100的一个或以上组件(例如,服务器110、终端130)通信。确定系统100的一个或以上组件可以经由网络120访问存储设备140中存储的数据和/或指令。在一些实施例中,存储设备140可以直接连接到系统100的一个或以上组件(例如,服务器110、终端130)或与之通信。在一些实施例中,存储设备140可以是服务器110的一部分。
本领域普通技术人员将理解,当系统100的元件(或组件)执行时,元件可以通过电信号和/或电磁信号执行。例如,当终端130向服务器110发送用户对目标音视频的操作时,终端130的处理器可以生成一个编码请求的电信号。然后,终端130的处理器可以将电信号发送到输出端口。若终端130经由有线网络与服务器110通信,则输出端口可物理连接至电缆,其进一步将电信号传输给服务器110的输入端口。如果终端130经由无线网络与服务器110通信,则终端130的输出端口可以是一个或以上天线,其将电信号转换为电磁信号。在电子设备内,例如终端130和/或服务器110,当处理器处理指令、发出指令和/或执行动作时,指令和/或动作通过电信号进行。例如,当处理器从存储介质(例如,存储设备140)检索或保存数据时,它可以将电信号发送到存储介质的读/写设备,其可以在存储介质中读取或写入结构化数据。该结构数据可以通过电子设备的总线,以电信号的形式传输至处理器。此处,电信号可以指一个电信号、一系列电信号和/或至少两个不连续的电信号。
图2是根据本申请的一些实施例所示的示例性计算设备的示例性硬件和/或软件组件的示意图。在一些实施例中,服务器110、终端130可以是在计算设备200上执行。例如,处理设备112可以在计算设备200上实现,并且被配置用于执行本申请中披露的处理设备112的功能。
计算设备200可用于实现如本文所述的确定目标音视频的系统100的任何组件。例如,处理设备112可以在计算设备200上通过其硬件、软件程序、固件或其组合实现。尽管仅示出了一个这样的计算机,但是为了方便,与本文所述的人机对话技术有关的计算机功能可以在多个类似平台上以分布式方式实现,以分配处理负荷。
计算设备200可以包括连接到与其连接的网络的通信端口250,以便于数据通信。计算设备200还可以包括以一个或以上逻辑电路的形式执行程序指令的处理器220。例如,处理器220可以包括其中的接口电路和处理电路。接口电路可以被配置为从总线210接收电信号,其中电信号编码用于处理电路的结构化数据和/或指令。处理电路可以进行逻辑计算,然后将结论、结果和/或指令编码确定为电信号。然后,接口电路可以经由总线210从处理电路发出电信号。
计算设备200还可以包括不同形式的程序存储和数据存储,包括:例如磁盘270、只读内存(ROM)230或随机存取内存(RAM)240,用于存储由计算设备200处理和/或传输的各种数据文件。计算设备200还可以包括存储在ROM 230、RAM 240和/或由处理器220执行的其他类型的非暂时性存储介质中的程序指令。本申请的方法和/或流程可以以程序指令的方式实现。计算设备200还包括输入/输出组件260,用来支持计算机和其他组件之间进行 输入/输出。计算设备200也可以通过网络通信接收编程和数据。
为了方便说明,图2中仅描述了一个处理器。也可以包括至少两个处理器,因此本申请中描述的由一个处理器执行的操作和/或方法步骤也可以由多个处理器共同地或单独执行。例如,如果在计算设备200的本申请处理器中执行操作A和操作B,应当理解,操作A和操作B也可以由计算设备200中的两个不同的CPU和/或处理器联合或分开执行(例如,第一处理器执行操作A,第二处理器执行操作B,或者第一和第二处理器共同执行操作A和B)。
图3是根据本申请的一些实施例所示的示例性移动设备的示例性硬件和/或软件组件的示意图。在一些实施例中,终端130可以在移动设备300上实现。如图3所示,移动设备300可以包括通信平台310、显示器320、图形处理单元(GPU)330、中央处理单元(CPU)340、输入/输出(I/O)350、内存360、移动操作系统(OS)370和存储器390。在一些实施例中,任何其他合适的组件,包括但不限于系统总线或控制器(未示出),也可包括在移动设备300内。
在一些实施例中,移动操作系统370(如,iOS TM、Android TM、Windows Phone TM)和一个或以上应用程序380可以从存储器390加载到内存360中以便由CPU340执行。应用程序380可以包括浏览器或任何其他合适的移动应用程序,用于接收和呈现与确定目标音视频有关的信息或来自确定目标音视频的系统100的其他信息。用户与信息流的交互可以通过I/O 350实现,并通过网络120提供给处理设备112和/或确定目标音视频的系统100的其他组件。
为了实施本申请描述的各种模块、单元及其功能,计算机硬件平台可用作本文中描述的一个或以上组件的硬件平台。具有用户接口组件的计算机可用于实施个人计算机(PC)或任何其他类型的工作站或终端设备。若计算机被适当的程序化,计算机亦可用作服务器。
图4是根据本申请的一些实施例所示的示例性处理设备的框图。处理设备可以是图1所描述的示例性处理设备112。在一些实施例中,处理设备112可以基于至少一项目标属性,确定目标音视频。在一些实施例中,处理设备112可以在处理单元(例如,图2所示的处理器210或图3所示的CPU 340)上实现。仅作为示例,可以在终端设备的CPU 340上实现处理设备112。如图4所示,处理设备112可以包括获取模块410、确定模块420、以及训练模块430。
所述获取模块410可以获取与系统100有关的信息。例如,所述获取模块可以获取用户发出的对话信息和所述用户的用户特征信息。关于所述用户发出的对话信息和所述用户的用户特征信息的描述可以参考本申请其他地方的相关描述(例如,图5及其相关描述),在 此不再赘述。
所述确定模块420可以确定所述对话信息的对话特征信息。所述确定模块420可以基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性。所述确定模块420也可以基于所述至少一项目标属性,确定所述目标音视频。所述确定模块420还可以确定在所述目标音视频播放的过程中,用户是否提供了用户反馈。所述确定模块420还可以响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,基于所述用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。关于对所述对话特征信息、所述至少一项目标属性、所述目标音视频、所述用户反馈、以及是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性的确定描述可以参考本申请其他地方的相关描述(例如,图5-8及其相关描述),在此不再赘述。
所述训练模块430可以用于训练目标属性确定模型和素材确定模型。关于目标属性确定模型和素材确定模型的训练的描述可以参见本申请其他地方的相关描述(例如,图5、图8以及相关描述),在此不再赘述。
应当理解,图4所示的系统及其模块可以利用各种方式来实现。例如,在一些实施例中,系统及其模块可以通过硬件、软件或者软件和硬件的结合来实现。其中,硬件部分可以利用专用逻辑来实现;软件部分则可以存储在存储器中,由适当的指令执行系统,例如微处理器或者专用设计硬件来执行。本领域技术人员可以理解上述的方法和系统可以使用计算机可执行指令和/或包含在处理器控制代码中来实现,例如在诸如磁盘、CD或DVD-ROM的载体介质、诸如只读存储器(固件)的可编程的存储器或者诸如光学或电子信号载体的数据载体上提供了这样的代码。本说明书的系统及其模块不仅可以有诸如超大规模集成电路或门阵列、诸如逻辑芯片、晶体管等的半导体、或者诸如现场可编程门阵列、可编程逻辑设备等的可编程硬件设备的硬件电路实现,也可以用例如由各种类型的处理器所执行的软件实现,还可以由上述硬件电路和软件的结合(例如,固件)来实现。
需要注意的是,以上对于处理设备400及其模块的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该系统的原理后,可能在不背离这一原理的情况下,对各个模块进行任意组合,或者构成子系统与其他模块连接。例如,在一些实施例中,图4中披露的获取模块、确定模块可以是一个系统中的不同模块,也可以是一个模块实现上述的两个或两个以上模块的功能。又例如,确定模块可以被细分为对话特征信息确定单元、目标属性确定单元以及目标音视频确定单元,分别 用于实现确定对话特征信息、确定至少一项目标属性以及确定目标音视频。还例如,处理设备400中各个模块可以共用一个存储模块,各个模块也可以分别具有各自的存储模块。诸如此类的变形,均在本说明书的保护范围之内。在一些实施例中,图4中的训练模块可以省略。对一个或多个机器学习模型的训练过程可以由外部处理设备完成。
图5是根据本申请一些实施例所示的确定目标音视频的示例性流程图。在一些实施例中,过程500可以通过存储在存储设备(例如存储设备140,计算设备200的ROM 230或RAM 240,或移动设备300的存储器390或内存360)中的一组指令(例如,应用程序)来实现。例如,处理器220和/或图4中的模块可以执行一组指令,并且当执行指令时,处理器220和/或模块可以被配置以执行过程500。以下所示过程的操作仅出于说明的目的。在一些实施例中,过程500可以利用未描述的一个或以上附加操作和/或没有在此讨论的一个或以上操作来完成。另外,如图5所示和下面描述的过程操作的顺序不是限制性的。
在510中,处理设备112(例如,获取模块410)可以获取与用户相关的对话信息。
如本文所使用的,“与用户相关的对话信息”指用户通过终端发出的对话信息和/或用户通过终端接收到的对话信息。例如,用户发出的对话信息包括但不限于语音、文本、图片等形式。在一些实施例中,用户可以通过终端(例如图1所示的终端130)上的用户界面发出对话信息,与计算设备(例如计算设备200,服务器110)进行交互。例如,用户可以与智能客服机器人、聊天机器人等进行对话。在一些实施例中,处理设备112可以通过网络120从终端获取用户发出的对话信息。例如,处理设备112可以通过网络120实时从终端获取用户发出的对话信息。
在一些实施例中,处理设备112可以获取用户最新发出的对话信息和/或上下文对话信息。在一些实施例中,用户最新发出的对话信息可以包括一个或多个字或词、一句话、一段话、一条或多条语音消息、一张或多张图片等。用户最新发出的对话信息可以包括陈述句(例如,“你好”)、疑问句(例如,“系统如何使用”)等。仅作为示例,用户可以通过终端上的用户界面提出一个问题,处理设备112可以通过生成目标音视频来回答用户的问题。上下文对话信息可以包括用户在发送最新的消息之前,通过终端发送和接收到的连续的信息。例如,当用户通过终端发送一条信息后,处理设备112可以分析用户发送的信息,并将反馈信息(如文字、语音、图片、音视频等)发送至终端。用户可以基于该反馈信息,继续发送新的信息。在这种情况下,新的信息就是用户最新发出的对话信息,而上下文对话信息包括用户之前发送的信息及上述反馈信息(也称为两轮连续的对话信息)。在一些实施例中,上下文对话信息可以包括最新的多轮对话信息,也可以包括全部的对话信息。对话信息的轮数 可预先设置在系统100中。
在520中,处理设备112(例如,确定模块420)可以确定对话信息的对话特征信息。
对话特征信息可以包括对话信息中的关键词、情绪等。在一些实施例中,处理设备112可以根据对话信息来确定对话特征信息。例如,处理设备112可以根据最新发出的对话信息来确定对话特征。若对话信息为文本,处理设备112可以根据关键词提取技术来提取对话信息中的关键词。示例性关键词提取技术可以包括但不限于Topic model、TFIDF、TextRank、RAKE等技术。若对话信息为语音,处理设备112可以通过语音识别技术将语音信息转换为本文,进而对转化的文本进行对话特征信息提取。
若对话信息为图片,处理设备112可以通过图像识别技术识别对话信息中的图片中的对话特征信息。例如,处理设备112可以识别图片中的文字,通过文字内容识别出关键词。再例如,处理设备112可以识别图片中的特征,从而判断出图片表达的情绪。例如,用户发出的图片为愤怒的表情,处理设备112可以从该图片中提取图像特征,从而判断出情绪内容。
在一些情况下,可通过对上下文对话信息来更准确地了解用户的问题或想要表达的内容。在一些实施例中,处理设备112可以根据上下文对话信息确定对话特征信息。通过对上下文对话信息进行分析,进行语义识别,可以确定出对话特征信息。具体地,处理设备112可以通过层次模型或非层次模型来确定对话特征信息。
在530中,处理设备112(例如,获取模块410)可以获取用户的用户特征信息。
用户特征信息可以包括用户个人信息、用户的偏好信息、用户的其他信息(例如,用户的爱好)等。
用户个人信息可以包括年龄、性别、学历、工作背景、健康状况、家庭住址、婚姻状况、教育背景等或其组合。例如,用户可以通过终端输入用户个人信息,例如,语音输入、文本输入等。处理设备112可以通过网络120从终端获取用户个人信息。再例如,终端上可以安装有APP,用户通过终端与APP进行交互,需要登录该APP。处理设备112可以通过终端获取用户的用户ID,并通过网络120从存储设备140获取该用户ID所对应的用户个人信息。
用户的偏好信息可以包括用户的偏好设置、用户当前的情绪或用户过去针对历史音视频提供的历史用户反馈信息。偏好设置可以包括用户对目标音视频播放速度的偏好设置(例如,慢速、正常、快速)、用户对目标音视频播放声音的偏好设置(例如,女声、男声)、用户对目标音视频播放内容的偏好设置(例如,简洁、详细)、用户对目标音视频播放画质(例如,蓝光、高清、标清)的偏好设置等。在一些实施例中,用户可以在终端提供用户的 偏好设置。例如,终端130可以提供偏好设置选择页面供用户选择(例如通过上述APP提供)。在一些实施例中,用户的偏好设置可以被保存在存储设备140中。处理器112可以基于用户在该APP登录使用的用户ID,通过网络120从存储设备140获取该用户ID所对应的用户偏好信息。
用户当前的情绪可以包括快乐、喜欢、悲伤、惊讶、愤怒、恐惧、厌恶。或者,用户的情绪可以被归类为正面、负面、中性等。处理设备112可以根据对话信息识别用户当前的情绪。例如,当与用户相关的对话信息为文本信息时,处理设备112可以通过文本情感分析技术来识别用户当前的情绪。示例性文本情感分析技术可以包括但不限于基于关键词提取规则的技术、基于机器学习模型(例如,支持向量机、神经网络、Logistic回归等)的技术等或其组合。例如,处理设备112可以获取情绪关键词列表(例如,正面词、负面词或表达愤怒的词、表达快乐的词、表达悲伤的词等),从文本中提取情绪关键词,与情绪关键词列表进行比较,从而确定出文本表达的情绪。
用户过去针对历史音视频提供的历史用户反馈信息可以包括用户对音视频的内容的反馈、对音视频播放声音的反馈、对音视频播放速度的反馈、对视频播放画质的反馈,对音视频播放过程的反馈等,或其任意组合。在一些实施例中,处理设备112可以通过用户对历史音视频的操作来判断历史用户反馈。例如,处理设备112可以通过用户对历史音视频进行的暂停、回放、快进等操作确定用户对音视频播放内容的反馈。又例如,处理设备112可以通过用户对历史音视频进行的调整播放速度的操作确定对音视频播放速度的反馈。再例如,处理设备112可以通过用户对历史视频进行的画质调整操作确定对视频播放画质的反馈。
在540中,处理设备112(例如,确定模块420)可以基于对话特征信息和用户特征信息,确定与目标音视频对应的至少一项目标属性。
在一些实施例中,目标属性可以包括但不限于语义信息、详细程度、理解难度、播放速度、播放声音的音色、播放画质、画面色调等或其组合。在一些实施例中,目标属性至少包括语义信息。语义信息指的是处理设备112根据用户的对话信息(例如一个问题)确定的反馈信息所要表达的含义。在一些实施例中,处理设备112可以根据提取的对话特征信息来确定语义信息。在一些实施例中,语义信息可以以关键词的形式表达。例如,基于用户最新发送的对话信息提取的关键词为“你好”,确定的语义信息可以是“问好”。
在一些实施例中,可以通过属性确定规则基于用户特征信息确定目标属性。属性确定规则可以用于确定全部的目标属性或单个的目标属性。
在一些实施例中,属性确定规则可以包括将单个用户特征信息与预设的参考信息(例 如类别或阈值)进行比较,确定目标属性。可以通过单个用户特征信息确定单个或多个目标属性,也可以通过多个用户特征确定单个目标属性。例如,可以根据用户的年龄与年龄阈值比较来确定多个目标属性。若大于第一年龄阈值(例如,60岁),则详细程度为详细,理解难度为简单,播放速度为慢速,音色为常规,画面色调为冷色调等沉稳色调。又例如,若用户的年龄小于第二年龄阈值(例如10岁),则详细程度为详细,理解难度为简单,播放速度为慢速,音色为儿童喜欢的音色(如卡通人物的音色),画面色调为暖色调。再例如,可以根据用户的学历与学历类别比较来确定理解难度。若学历属于小学生,则理解难度为较低。
在一些实施例中,属性确定规则可以包括将多个用户特征信息进行组合后与预设的参考信息进行比较,以确定单个目标属性。具体地,可以对多个用户特征信息赋予权重,以加权的方式进行组合。例如,用户年龄,用户学历,和用户的情绪三个特征可以进行组合,权重分别为0.3,0.4,0.3。用户年龄在18岁与50岁之间,则对应于年龄的理解难度评分为0.5(或适中)。学历为高中,则对应于学历的理解难度评分为0.3(或偏低)。情绪为暴躁或不耐烦,则对应于情绪的理解难度评分为0.3(或偏低)。对应于这三个用户特征的理解难度总分,可以是0.3*0.5+0.4*0.3+0.3*0.3=0.36。在一些实施例中,可以用数值(例如该理解难度总分)来直接反映目标属性。在一些实施例中,可以用一个类别来反映目标属性。例如该总分偏低,则详细程度可以为偏低。
在一些实施例中,处理设备112还可以基于机器学习模型确定目标属性。具体地,处理设备112可以获取至少一个训练后的目标属性确定模型;将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型;以及基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。关于使用机器学习模型确定目标属性的相关内容可参考图6及其描述,在此不再赘述。
在550中,处理设备112(例如,确定模块420)可以基于至少一项目标属性,确定目标音视频。通过至少一项目标属性,可以给用户提供个性化目标音视频内容,给用户提供更好的观看体验,也更有助于用户解答问题,和/或提高用户人机交互的体验。
在一些实施例中,处理设备112可以从存储设备140中获取数据库。数据库可以包括多个素材,例如一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像中的至少一个。数据库中的每个素材都具有至少一项基本属性,例如语义信息、详细程度、理解难度、情绪属性等。处理设备112可以基于至少一项目标属性,从数据库中选取一个与目标属性匹配的目标音视频。例如,用户发出的对话信息为“这个系统怎么用?”;处理器提取的对话特征信息可以是“系统”和“怎么用”;处理器112通过网络120获取到 用户的偏好设置,用户对目标音视频播放内容的偏好设置为详细;则处理器112确定的目标音视频可以是详细版的关于如何使用系统的视频。
在一些实施例中,当处理设备112确定数据库中没有与目标属性匹配的目标音视频时,可以从数据库中获取与目标属性匹配的一个或多个非音视频形式的素材,例如文字、图片等。可选地,处理器112可以直接将与目标属性匹配的文字或图片发送至终端,以与用户进行对话。处理器112还可以基于上述与目标属性匹配的一个或多个非音视频形式的素材,生成目标音视频。例如,处理器112可以基于文字生成语音,并将文字、图像和语音合成目标视频。根据数据库确定目标音视频的内容可以参考图7及其描述,在此不再赘述。
在一些实施例中,处理设备112可以根据机器学习模型基于至少一项目标属性确定目标音视频。关于根据机器学习模型确定目标音视频的内容可以参考图8及其描述,在此不再赘述。
在一些实施例中,目标音视频可以是单个视频。在一些实施例中,目标音视频可以包括多个按顺序排列的片段。终端播放所述目标音视频时,可以按顺序播放上述多个片段。例如,多个片段可以是内容相似的片段。处理器112可以从数据库中找到多个与目标属性匹配的音视频,并按照一定规则对多个音视频排序,组成一个完整的目标音视频。例如,处理器112可以按照音视频与目标属性的匹配度对上述多个音视频进行排序。再例如,处理器112可以按照目标属性中的某一项的数值对上述多个音视频进行排序,例如按照理解难度的值由低到高排序(即上述多个音视频的理解难度递增)。在一些实施例中,处理设备112可以根据机器学习模型(例如图8中描述的素材确定模型),直接生成含有多个片段的目标音视频。
在560中,处理设备112(例如,确定模块420)可以确定在所述目标音视频播放的过程中,用户是否提供了用户反馈。
在所述目标音视频播放的过程中,用户提供的用户反馈可以包括暂停次数、暂停时长、回放次数、回放时长、快进次数、快进时长、慢播次数、慢播时长、是否提出新的问题、是否提前结束播放,等或其组合。这些用户反馈可以表明用户对于目标音视频的理解和吸收能力,和/或用户对目标音视频的接受程度(例如喜欢或不喜欢)。在一些实施例中,处理设备112可以通过用户在终端130上对目标音视频执行的操作来确定用户是否提供了用户反馈。例如,终端130可以确定用户在终端130对目标音视频的执行的暂停操作、回放操作、快进操作、慢播操作、关闭操作、发送新消息的操作等,继而通过网络120传输给处理设备112。具体地,终端130可以检测到用户对用户界面上的一些按钮的操作和/或用户在用户界面上的手势操作。例如终端130可以检测到用户对目标音视频点击的暂停按钮、回放、快进、慢速 播放按钮、关闭按钮,确定执行暂停操作、回放操作、快进操作、慢播操作、关闭操作。例如,若用户对目标音视频的快进次数大于一定阈值(例如3次),这可能反映了目标音视频内容对于用户来说过于简单易懂,或用户不喜欢目标音视频的内容,或用户偏好于较快的播放速度等。
在570中,处理设备112(例如,确定模块420)可以响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,基于所述用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。通过确定用户是否提供了用户反馈,处理器112可以进一步优化目标音视频的内容,为用户提供更好的观看效果,提高用户体验。
在一些实施例中,未播放片段的基本属性可以与目标属性相对应。基本属性可以仅包括部分目标属性,例如,详细程度。基本属性还可以包括全部目标属性。基本属性可以包括语义信息、详细程度、理解难度、播放速度、播放声音、播放画质、画面色调等或其组合。当检测到用户反馈,处理设备112可以根据用户反馈来调整目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。在一些实施例中,可以通过属性调整规则来确定调整后的基本属性。属性调整规则可以类似于属性确定规则。在一些实施例中,属性调整规则可以包括用户反馈中单种类型的反馈是否大于阈值。仅作为示例,若暂停次数大于阈值(例如,3),则可以将至少一个未播放片段调整为更详细的片段。若快进时长大于阈值(例如,3min),则可以调整目标音视频至少一个未播放片段的播放速度,例如可以调整为1.5倍播放速度。应当注意的是,本申请并不限制属性确定规则和属性调整规则。
应该注意的是,关于过程500的描述出于说明性目的,并不用于限制本申请的保护范围。对于本领域的技术人员来说,可以在本申请的指示下做出多个变体和修改。然而,这些变体和修改不会脱离本申请的保护范围。例如,过程500可以进一步包括将目标音视频存储到存储设备140中。再例如,过程500中操作560-570可以省略。
图6是根据本申请一些实施例所示的确定与目标音视频对应的至少一项目标属性的流程图。在一些实施例中,过程600的一个或者多个步骤可以被执行以获取图5中所述步骤540中所述的至少一项目标属性。在一些实施例中,过程600可以通过存储在存储设备(例如存储设备140,计算设备200的ROM 230或RAM 240,或移动设备300的存储器390或内存360)中的一组指令(例如,应用程序)来实现。例如,处理器220和/或图4中的模块可以执行一组指令,并且当执行指令时,处理器220和/或模块可以被配置以执行过程500。以下所示过程的操作仅出于说明的目的。在一些实施例中,过程600可以利用未描述的一个或 以上附加操作和/或没有在此讨论的一个或以上操作来完成。另外,如图6所示和下面描述的过程操作的顺序不是限制性的。
在610中,处理设备112(例如,获取模块420)可以获取至少一个训练后的目标属性确定模型。
目标属性确定模型可以是用于确定与目标音视频对应的至少一项目标属性的模型,例如机器学习模型。在一些实施例中,所述至少一个目标属性确定模型可以包括深度学习模型,例如,深度神经网络(DNN)模型、卷积神经网络(CNN)模型、递归神经网络(RNN)模型、特征图金字塔网络(FPN)模型、Seq2Seq模型、长短期记忆(LSTM)模型等。仅仅作为示例,所述目标属性确定模型可以接收模型输入(例如,对话特征信息、用户特征新、和/或其他与用户有关的信息),并且所述目标属性确定模型可以输出至少一项目标属性信息。在一些实施例中,所述目标属性确定模型可以输出一个音视频属性信息序列。例如,所述音视频属性信息序列可以包括按顺序排列的多组属性信息,其中所述多组属性信息中的每一组对应于目标音视频的一个片段。
在一些实施例中,处理设备112(例如,训练模块430)可以从系统100的一个或多个组件(例如,存储设备140、终端130)、或者第三方系统(例如,目标属性确定模型的供应方的数据库系统)中获取所述至少一个训练后的目标属性确定模型。例如,所述至少一个目标属性确定模型可以由计算设备(例如,处理设备112)提前训练好,并存储在系统100的存储器中(例如,存储设备140、存储器220、和/或存储器390)。处理设备112可以访问所述存储器并检索所述至少一个目标属性确定模型。在一些实施例中,所述至少一个目标属性确定模型可以根据一种或多种机器学习算法生成。所述一种或者多种机器学习算法可以包括但不限于人工神经网络算法、深度学习算法、决策树算法、关联规则算法、归纳逻辑编程算法、支持向量机算法、聚类算法、贝叶斯网络算法、强化学习算法、表示学习算法、相似性度量学习算法、稀疏字典学习算法、遗传算法、基于规则的机器学习算法等中一种或其任意组合。
仅仅作为示例,处理设备112或另一计算设备(例如,用于训练目标属性确定模型的外部计算设备)可以根据有监督学习算法来训练所述目标属性确定模型。处理设备112可以获取一个或多个第一训练样本和一个第一初始模型。每个第一训练样本可以包括样本用户的样本对话特征信息、样本用户特征信息、以及给用户播放的样本音视频的样本属性信息或者样本属性信息序列。待训练的第一初始模型可以包括一个或多个模型参数,例如层数、节点数、第一损失函数等或其任意组合。在训练之前第一初始模型可以具有一个或多个模型参数 的初始参数值。
对第一初始模型的训练可以包括一个或多个第一迭代过程,以基于一个或多个第一训练样本来迭代更新第一初始模型的模型参数,直到在某一迭代过程中满足第一终止条件为止。示例性的第一终止条件可以是在某一迭代过程中获得的第一损失函数的值小于阈值、已经执行了一定数量的迭代过程、第一损失函数收敛使得在前一次迭代过程中所获取的第一损失函数的值与当前迭代过程中所获取的第一损失函数的值的差异在某个阈值范围内等。所述第一损失函数可以用于测量在一次迭代过程中由所述第一初始模型预测的音视频属性信息与样本音视频属性信息、或预测的音视频属性信息序列与样本音视频属性信息序列之间的差异。例如,可以将每个第一训练样本的样本用户的样本对话特征信息和样本用户特征信息输入到所述第一初始模型中,并且所述第一初始模型可以输出第一训练样本的预测音视频属性信息或预测音视频属性信息序列。所述第一损失函数可以用于测量每个第一训练样本的预测音视频属性信息与样本音视频属性信息、或预测音视频属性信息序列与样本音视频属性信息序列之间的差异。示例性的第一损失函数可以包括焦点损失函数、对数损失函数、交叉熵损失等。如果在当前迭代过程中不满足所述第一终止条件,则处理设备112可以进一步根据机器学习算法(例如,反向传播算法)更新用于下一迭代过程的第一初始模型。如果在当前迭代过程中满足所述第一终止条件,则处理设备112可以将当前迭代过程中的第一初始模型指定为所述目标属性确定模型。
在620中,处理设备112(例如,确定模块420)可以将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型。
在一些实施例中,处理设备112可以将步骤520确定的所述对话特征信息和步骤530确定的用户特征信息输入到一个目标属性确定模型中,该目标属性确定模型可以输出所有的音视频属性信息或者音视频属性信息序列。在一些实施例中,处理设备112可以将步骤520确定的所述对话特征信息和步骤530确定的用户特征信息输入到多个目标属性确定模型中,该多个目标属性确定模型中的每个目标属性确定模型可以输出对应的一项或多项音视频属性信息、或一项或多项音视频属性信息序列。例如,处理设备可以将步骤520确定的所述对话特征信息和步骤530确定的用户特征信息分别输入到第一目标属性确定模型和第二目标属性确定模型中,所述第一目标属性确定模型可以输出对应的一项或多项音视频属性信息或音视频属性信息序列,例如,语义信息、详细程度、理解难度,所述第二目标属性确定模型可以输出其余的音视频属性信息或者属性信息序列。
在一些实施例中,处理设备可以对所述对话特征信息和所述用户特征信息的至少一部 分进行预处理,生成对应的模型输入特征序列,并将模型输入特征序列输入到所述至少一个目标属性确定模型中,以获取音视频属性信息或音视频属性信息序列。例如,处理设备112的预处理操作可以包括去除对判断句子语义无关的特殊字符,将对话中的部分非关键信息进行归一化并映射为统一的字符等。
步骤630,处理设备112(例如,确定模块420)可以基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。如步骤620所述,所述至少一个目标属性确定模型的输出可以是一项或者多项音视频属性信息或音视频属性信息序列。在一些实施例中,处理设备112可以获取所述至少一个目标属性确定模型的输出,并基于所获取的模型输出,确定所述至少一项目标属性。例如,处理设备112可以对所获取的一项或多项音视频属性信息进行排序,例如,按照重要性进行排序。根据实际需要,处理设备112可以进一步按照排列顺序选择一项或多项音视频属性信息(例如,排列靠前的一项或多项音视频属性信息)作为所述至少一项目标属性。
在一些实施例中,处理设备112可以将步骤520确定的对话特征信息和步骤530确定的用户特征信息的至少一部分发送到另外的计算设备(例如,目标属性确定模型的供应方的计算设备)。该计算设备可以基于获取的对话特征信息和用户特征信息生成一项或多项音视频属性信息,并将生成的一项或多项音视频属性信息发送到处理设备112。处理设备112可以基于接收到的一项或多项音视频属性信息,确定至少一项目标属性。
应该注意的是,关于过程600的描述出于说明性目的,并不用于限制本申请的保护范围。对于本领域的技术人员来说,可以在本申请的指示下做出多个变体和修改。然而,这些变体和修改不会脱离本申请的保护范围。例如,过程600可以进一步包括对模型输出的获取步骤,或一个或多个存储步骤(例如,对模型的输入和输出结果进行存储)。
图7是根据本申请一些实施例所示的基于数据库确定目标音视频的流程图。在一些实施例中,过程700可以通过存储在存储设备(例如存储设备140,计算设备200的ROM 230或RAM 240,或移动设备300的存储器390或内存360)中的一组指令(例如,应用程序)来实现。例如,处理器220和/或图4中的模块可以执行一组指令,并且当执行指令时,处理器220和/或模块可以被配置以执行过程700。以下所示过程的操作仅出于说明的目的。在一些实施例中,过程700可以利用未描述的一个或以上附加操作和/或没有在此讨论的一个或以上操作来完成。另外,如图7所示和下面描述的过程操作的顺序不是限制性的。
在710中,处理设备112(例如,获取模块410)可以获取数据库。在一些实施例中,数据库可以是系统100预先制作并存储的数据库。数据库还可以是从外部资源中获取而形成 的数据库,例如,基于网络120从外部存储设备获取的数据库。数据库可以包括视觉数据库、音频数据库、文本数据库、图片数据库等或其组合。
数据库可以用于提供目标音视频的候选内容。数据库中可以包括多个素材,例如,文字、图像、音频、视频。在一些实施例中,多个素材可以包括一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像等或其组合。在一些实施例中,多个素材中的候选内容可以是由处理设备112通过与用户相关的对话特征信息确定。例如,处理设备112可以根据对话信息中的关键词来确定候选内容。仅作为示例,若关键词为“系统,如何使用”,则多个素材可以包括与系统使用相关的多个候选内容。
在720中,处理设备112(例如,确定模块420)可以基于至少一项目标属性和数据库,确定目标音视频。
在一些实施例中,处理设备112可以通过将至少一项目标属性中的每一项与数据库中的素材进行匹配,基于匹配结果选择目标素材,从而确定目标音视频。例如,选择的目标素材可以是音视频,并且可以直接被指定为目标音视频。再例如,处理设备可以进一步按照目标属性来调整音视频形式的目标素材的至少部分基本属性,以生成目标视频,例如调整播放速度、播放声音、画面色调等。在一些实施例中,处理设备112可以基于目标素材,生成新的视频作为目标音视频。例如,处理设备112可以基于目标素材中的一段或多段目标文字生成目标音视频。具体地,处理设备112可以获取上述一段或多段目标文字的文本序列,然后基于文本序列生成对应的语音序列作为目标音频。又例如,处理设备112可以进一步基于该目标音频以及目标素材中的一张或多张目标图片,生成目标视频。处理设备112还可以基于上述目标素材的多种组合来确定目标音视频,例如基于从数据库中找到的多个目标图片与一个目标音频,生成目标视频。
在一些实施例中,为了确定目标素材,对于至少一项目标属性中的每一项,处理设备112可以计算数据库中多个素材中的每个素材的基本属性与各项目标属性的匹配度。匹配度可以以数字(例如,1-10)或等级(例如,高、中、低)表示。以一段候选文字作为素材为例,处理设备112需要计算此段候选文字的基本属性与每一项目标属性的匹配度。在一些实施例中,基本属性和目标属性可以用数值来表示。为了计算一个素材的一项基本属性与一项对应的目标属性之间的匹配度,处理器可以比较该素材的该项基本属性的数值与对应的目标属性的数值之间的差异(例如通过确定比值)。类似地,处理设备112可以计算候选文字的基本属性中的语义信息与目标属性中的语义信息的匹配度、候选文字的基本属性中的详细程度与目标属性中的详细程度的匹配度等。
基于计算出的每个素材的各项基本属性与各项目标属性的匹配度,处理设备112可以确定一项或多项匹配分数。在一些实施例中,一项基本属性与一项目标属性的匹配分数可以与匹配度成正相关(或负相关)。匹配分数可以是数字(例如,百分比)形式,例如30%,60%。在一些实施例中,处理设备112可以基于多个素材中的每个素材的一项或多项匹配分数,从多个素材中,选择一个或多个目标素材。例如,处理设备112可以将每个素材相对应目标属性的匹配分数进行求和得到匹配总分,再依照总匹配分数对各个素材进行排序。再例如,处理设备112可以将每个素材相对应目标属性的匹配分数取平均值,根据平均值对各个素材进行排序。又例如,可以对多项目标属性分配不同的权重值,将各个匹配分数与对应的目标属性的权重值相乘后进行加和得到匹配总分,根据匹配总分对各个素材进行排序。处理器112可以进一步根据排序结果选择对应的目标素材(例如,前20%,或前三个素材)。
在一些实施例中,处理设备112可以基于一个或多个目标素材和至少一项目标属性,通过对一个或多个目标素材的基本属性进行调整后,确定所述目标音视频。例如,所述一个或多个目标素材可以包括一个或多个初始音视频。处理设备112可以按照至少一项目标属性,对初始音视频的至少部分基本属性进行调整,例如可以调整初始音视频的播放速度、画面色调、声音音色等。
应当注意,关于过程700的以上描述仅是出于说明的目的而提供的,并且不旨在限制本申请的范围。对于本领域普通技术人员而言,可以在本申请的教导下进行多种变化和修改。然而,那些变化和修改不脱离本申请的范围。在一些实施例中,过程700可以利用一个或多个未描述的附加操作和/或本文未讨论的一个或多个操作来完成。附加地或可替代地,图7所示的过程700的操作顺序并非限制性的。
图8是根据本申请一些实施例所示的另一确定目标音视频的流程图。在一些实施例中,过程800可以通过存储在存储设备(例如存储设备140,计算设备200的ROM 230或RAM 240,或移动设备300的存储器390或内存360)中的一组指令(例如,应用程序)来实现。例如,处理器220和/或图4中的模块可以执行一组指令,并且当执行指令时,处理器220和/或模块可以被配置以执行过程800。以下所示过程的操作仅出于说明的目的。在一些实施例中,过程800可以利用未描述的一个或以上附加操作和/或没有在此讨论的一个或以上操作来完成。另外,如图5所示和下面描述的过程操作的顺序不是限制性的。
在810中,处理设备112(例如,获取模块410)可以获取训练后的素材确定模型。所述素材确定模型可以是用于生成与目标音视频有关的目标素材的模型,例如机器学习模型。在一些实施例中,所述素材确定模型可以包括深度学习模型,例如,深度神经网络(DNN) 模型、卷积神经网络(CNN)模型、递归神经网络(RNN)模型、特征图金字塔网络(FPN)模型、Seq2Seq模型、长短期记忆(LSTM)模型等。仅仅作为示例,所述素材确定模型可以接收模型输入(例如,对话特征信息、和/或其他与用户有关的信息),并且所述素材确定模型可以输出与目标音视频有关的一个或多个目标素材。所述与目标音视频有关的一个或多个目标素材可以包括一个或多个音频、一个或多个视频、一个或多个文字、一个或多个图像等。关于目标素材的相关内容可以参考图7及其描述,在此不再赘述。
在一些实施例中,处理设备112可以从系统100的一个或多个组件(例如,存储设备140、终端130)、或者第三方系统(例如,素材确定模型的供应方的数据库系统)中获取所述训练后的素材确定模型。例如,所述素材确定模型可以由计算设备(例如,处理设备112)提前训练好,并存储在系统100的存储器中(例如,存储设备140、存储器220、和/或存储器390)。处理设备112可以访问所述存储器并检索所述素材确定模型。在一些实施例中,所述素材确定模型可以根据本申请其他地方所描述的一种或多种机器学习算法生成(例如,图6中步骤610及其相关描述)。
仅仅作为示例,处理设备112(例如,训练模块430)或另一计算设备(例如,素材确定模型的供应方的计算设备)可以根据有监督学习算法来训练所述素材确定模型。处理设备112可以获取一个或多个第二训练样本和一个第二初始模型。每个第二训练样本可以包括样本用户的样本对话特征信息以及一个或多个样本音视频素材。待训练的第二初始模型可以包括一个或多个模型参数,例如层数、节点数、第二损失函数等或其任意组合。在训练之前第二初始模型可以具有一个或多个模型参数的初始参数值。
对第二初始模型的训练可以包括一个或多个第二迭代过程,以基于一个或多个第二训练样本来迭代更新第二初始模型的模型参数,直到在某一迭代过程中满足第二终止条件为止。示例性的第二终止条件可以是在某一迭代过程中获得的第二损失函数的值小于阈值、已经执行了一定数量的迭代过程、第二损失函数收敛使得在前一次迭代过程中所获取的第二损失函数的值与当前迭代过程中所获取的第二损失函数的值的差异在某个阈值范围内等。所述第二损失函数可以用于测量在一次迭代过程中由所述第二初始模型预测的一个或多个音视频素材与对应的样本音视频素材之间的差异。例如,可以将每个第二训练样本的样本用户的样本对话特征信息输入到所述第二初始模型中,并且所述第二初始模型可以输出训练样本的一个或多个预测音视频素材。所述第二损失函数可以用于测量每个训练样本的一个或多个预测音视频素材与对应的样本音视频素材之间的差异。示例性的第二损失函数可以包括焦点损失函数、对数损失函数、交叉熵损失等。如果在当前迭代过程中不满足所述第二终止条件,则处理设 备112可以进一步根据机器学习算法(例如,反向传播算法)更新用于下一迭代过程的第二初始模型。如果在当前迭代过程中满足所述第二终止条件,则处理设备112可以将当前迭代过程中的第二初始模型指定为所述素材确定模型。
在820中,处理设备112(例如,确定模块420)可以将所述对话特征信息输入所述素材确定模型。
在一些实施例中,处理设备112可以将步骤520确定的所述对话特征信息直接输入到素材确定模型中,该素材确定模型可以输出一个或多个音视频素材。在一些实施例中,处理设备112可以将步骤520确定的所述对话特征信息进行编码生成对话特征信息序列,并将所述对话特征序列输入到所述素材确定模型,该素材确定模型可以输出对应的音视频素材序列。
在一些实施例中,处理设备可以对所述对话特征信息的至少一部分进行预处理,并将预处理后的所述对话特征信息的至少一部分输入到所述素材确定模型中以获取一个或多个音视频素材或者音视频素材序列。例如,处理设备112可以执行一个或多个预处理操作,例如,对所述对话特征信息进行预处理以生成对应的模型输入序列,去除对判断句子语义无关的特殊字符,将对话中的链接、地名等信息进行归一化并映射为统一的字符等。
在830中,处理设备112(例如,确定模块420)可以基于所述素材确定模型的输出,确定初始音视频。
如步骤820所述,所述素材确定模型的输出可以是一个或多个音视频素材或音视频素材序列。在一些实施例中,处理设备112可以获取所述素材确定模型的输出,并基于所获取的模型输出,确定所述初始音视频。例如,如果所述素材模型直接输出一个完整的音视频,则处理设备112将该音视频指定为所述初始音视频。又例如,如果所述素材模型输出的是两个或两个以上音视频片段、或由多个音视频片段组成的视频序列,处理设备112可以将所述两个或两个以上音视频片段、或者该音视频序列中的多个音视频片段按照一定的顺序进行拼接,以生成初始音视频。再例如,如果所述素材模型输出的是一个或多个图片、一个或多个音频,处理设备112可以将所述一个或多个图片与对应的一个或多个音频进行组合,以生成所述初始音视频。再例如,如果所述素材模型输出的是一个或多个图片、一个或多个文字,处理设备112可以将所述一个或多个图片与对应的一个或多个文字进行组合,以生成所述初始音视频。仅作为示例,所述素材确定模型的输出可以是一段文字,处理器112可以将这段文字转换为音频,并结合含有虚拟人物的画面生成一段视频,模拟虚拟人物与用户进行视频对话。这种方式有利于提高用户对话的兴趣,给用户带来良好的用户体验。例如,当用户是儿童时,处理器112可以生成含有卡通人物的视频,模拟卡通人物与用户进行视频对话。
在一些实施例中,处理设备112可以将步骤520确定的对话特征信息的至少一部分发送到另外的计算设备(例如,素材确定模型的供应方的计算设备)。该计算设备可以基于获取的对话特征信息生成一个或多个音视频素材,并将生成的一个或多个音视频素材发送到处理设备112。处理设备112可以基于接收到的一个或多个音视频素材,确定所述初始音视频。
步骤840,处理设备112(例如,确定模块420)可以基于所述至少一项目标属性,通过对所述初始音视频的基本属性进行调整来生成所述目标音视频。关于所述至少一项目标属性的确定可以参考本申请其他地方的描述(例如,图5、图6及其相关描述),在此不再赘述。
在一些实施例中,处理设备112可以参照所述至少一项目标属性,确定需要调整的所述初始音视频的基本属性。处理设备112可以进一步确定需要调整的基本属性的可调范围。处理设备112可以进一步基于所述需要调整的基本属性的可调范围和所述对应的目标属性,对需要调整的基本属性进行调整,使得调整后的基本属性与对应的目标属性一致或相近。所基本属性可调范围指在一定范围内,可以对所述基本属性进行调整,例如,所述至少一项目标属性包括详细程度为详细、理解难度为简单、播放速度为慢速、播放声音为常规男声、画面色调为,所述初始音视频的基本属性包括详细程度为详细、理解难度为简单、播放速度为快速、播放声音为童声、画面色调为暖色调。处理设备112可以确定需要调整的基本属性包括播放速度和播放声音,并进行相应调整。处理设备112可以将调整后的初始音视频指定为所述目标音视频。如果所述初始音视频的基本属性与所述至少一项目标属性中的每一项匹配,则处理设备112可以直接将所述初始音视频指定为所述目标音视频。
应该注意的是,关于过程800的描述出于说明性目的,并不用于限制本申请的保护范围。对于本领域的技术人员来说,可以在本申请的指示下做出多个变体和修改。然而,这些变体和修改不会脱离本申请的保护范围。例如,过程800可以进一步包括一个发送步骤将所述目标音视频发送到目标终端,或一个或多个存储步骤(例如,对所述初始音视频、目标音视频进行存储)。
图9是根据本申请一些实施例所示的终端与服务器进行交互的示意图。在一些实施例中,图9所示的交互过程900以及示例性步骤可以通过存储在存储设备(例如存储设备140,计算设备200的ROM 230或RAM 240,或移动设备300的存储器390或内存360)中的一组指令(例如,应用程序)来实现。例如,处理器220和/或图4中的模块可以执行一组指令,并且当执行指令时,处理器220和/或模块可以被配置以执行交互过程900。以下所示过程的操作仅出于说明的目的。在一些实施例中,交互过程900可以利用未描述的一个或以上附加 操作和/或没有在此讨论的一个或以上操作来完成。交互过程900仅作为示例,用于说明本申请的人机交互的整个应用流程,并不作为对本申请的限制。
交互过程900可以应用于各种智能人机对话的应用场景,包括但不限于用户与智能客服机器人、智能音响、聊天机器人、智能家居设备(例如,智能电视、智能空调、智能扫地/拖地设备)、智能交通工具、与终端上的网页或APP等进行对话的应用场景。根据交互过程900,用户可以通过终端130(例如,终端130上的用户界面)与服务器110(或服务器110中的处理设备112)进行交互。仅作为示例,用户可以通过终端130上的用户界面提出一个问题,处理设备112可以通过生成目标音视频来回答用户的问题,用户还可以对目标音视频提供用户反馈,从而优化目标音视频内容,更好地为用户服务。具体的,步骤901、905、908、909、9011由终端130执行,步骤902、903、904、906、907、9010、9012、9013由服务器110执行。
在901中,用户可以通过终端130接收用户输入的与用户相关的对话信息(或简称为对话信息)。“与用户相关的对话信息”指用户通过终端发出的对话信息和/或用户通过终端接收到的对话信息。例如,用户发出的对话信息包括但不限于语音、文本、图片等形式。终端130接收用户输入的对话信息后可以通过网络120传输到服务器110(例如,处理设备112),从而获取对话信息。以终端130为智能客服机器人为例,用户可以通过智能客服机器人的用户界面输入与用户相关的对话信息(或简称为对话信息),来与智能客服机器人进行对话。
在902中,处理设备112可以获取终端130发出的用户最新发出的对话信息和/或上下文对话信息。在一些实施例中,用户最新发出的对话信息可以包括一个或多个字或词、一句话、一段话、一条或多条语音消息、一张或多张图片等。用户最新发出的对话信息可以包括陈述句(例如,“你好”)、疑问句(例如,“系统如何使用”)等。上下文对话信息可以包括用户在发送最新的消息之前,通过终端发送和接收到的连续的信息。例如,当用户通过终端发送一条信息后,处理设备112可以分析用户发送的信息,并将反馈信息(如文字、语音、图片、音视频等)发送至终端。用户可以基于该反馈信息,继续发送新的信息。在这种情况下,新的信息就是用户最新发出的对话信息,而上下文对话信息包括用户之前发送的信息及上述反馈信息(也称为两轮连续的对话信息)。在一些实施例中,上下文对话信息可以包括最新的多轮对话信息,也可以包括全部的对话信息。对话信息的轮数可预先设置在系统100中。
处理设备112可以确定对话信息的对话特征信息。对话特征信息可以包括对话信息中的关键词、情绪等。在一些实施例中,处理设备112可以根据对话信息来确定对话特征信息。 例如,处理设备112可以根据最新发出的对话信息来确定对话特征。若对话信息为文本,处理设备112可以根据关键词提取技术来提取对话信息中的关键词。示例性关键词提取技术可以包括但不限于Topic model、TFIDF、TextRank、RAKE等技术。若对话信息为语音,处理设备112可以通过语音识别技术将语音信息转换为本文,进而对转化的文本进行对话特征信息提取。
若对话信息为图片,处理设备112可以通过图像识别技术识别对话信息中的图片中的对话特征信息。例如,处理设备112可以识别图片中的文字,通过文字内容识别出关键词。再例如,处理设备112可以识别图片中的特征,从而判断出图片表达的情绪。例如,用户发出的图片为愤怒的表情,处理设备112可以从该图片中提取图像特征,从而判断出情绪内容。
在一些情况下,可通过对上下文对话信息来更准确地了解用户的问题或想要表达的内容。在一些实施例中,处理设备112可以根据上下文对话信息确定对话特征信息。通过对上下文对话信息进行分析,进行语义识别,可以确定出对话特征信息。具体地,处理设备112可以通过层次模型或非层次模型来确定对话特征信息。
在一些实施例中,用户还可以通过终端输入用户特征信息。用户特征信息可以包括用户个人信息、用户的偏好信息、用户的其他信息(例如,用户的爱好)等。
用户个人信息可以包括年龄、性别、学历、工作背景、健康状况、家庭住址、婚姻状况、教育背景等或其组合。例如,用户可以通过语音输入、文本输入等输入。示例性的,终端可以提供用户个人信息的表达供用户进行填写。处理设备112可以通过网络120从终端获取用户个人信息。再例如,终端上可以安装有APP,用户通过终端与APP进行交互,需要登录该APP。处理设备112可以通过终端获取用户的用户ID,并通过网络120从存储设备140获取该用户ID所对应的用户个人信息。
用户的偏好信息可以包括用户的偏好设置、用户当前的情绪或用户过去针对历史音视频提供的历史用户反馈信息。偏好设置可以包括用户对目标音视频播放速度的偏好设置(例如,慢速、正常、快速)、用户对目标音视频播放声音的偏好设置(例如,女声、男声)、用户对目标音视频播放内容的偏好设置(例如,简洁、详细)、用户对目标音视频播放画质(例如,蓝光、高清、标清)的偏好设置等。在一些实施例中,用户可以在终端提供用户的偏好设置,例如,终端130可以提供偏好设置选择页面供用户选择(例如通过上述APP提供)。在一些实施例中,用户的偏好设置可以被保存在存储设备140中。处理器112可以基于用户在该APP登录使用的用户ID,通过网络120从存储设备140获取该用户ID所对应的用户偏好信息。
用户当前的情绪可以包括快乐、喜欢、悲伤、惊讶、愤怒、恐惧、厌恶。或者,用户的情绪可以被归类为正面、负面、中性等。处理设备112可以根据对话信息识别用户当前的情绪。例如,当与用户相关的对话信息为文本信息时,处理设备112可以通过文本情感分析技术来识别用户当前的情绪。示例性文本情感分析技术可以包括但不限于基于关键词提取规则的技术、基于机器学习模型的技术等或其组合。例如,处理设备112可以获取情绪关键词列表(例如,正面词、负面词或表达愤怒的词、表达快乐的词、表达悲伤的词等),从文本中提取情绪关键词,与情绪关键词列表进行比较,从而确定出文本表达的情绪。
用户过去针对历史音视频提供的历史用户反馈信息可以包括用户对音视频的内容的反馈、对音视频播放声音的反馈、对音视频播放速度的反馈、对视频播放画质的反馈,对音视频播放过程的反馈等,或其任意组合。在一些实施例中,终端130可以存储用户对历史音视频进行的操作,处理设备112可以通过访问终端130获取用户对历史音视频的操作,从而判断历史用户反馈。例如,终端130可以存储有用户对历史音视频进行的暂停、回放、快进等操作,处理设备112可以通过获取用户对历史音视频进行的暂停、回放、快进等操作确定用户对音视频播放内容的反馈。又例如,终端130可以存储有用户对历史音视频进行的调整播放速度的操作,处理设备112可以获取用户对历史音视频进行的调整播放速度的操作确定对音视频播放速度的反馈。再例如,终端130可以存储有用户对历史视频进行的调整播放速度的操作,处理设备112可以通过用户对历史视频进行的画质调整操作确定对视频播放画质的反馈。
在903中,处理设备112可以基于对话特征信息和用户特征信息,确定与目标音视频对应的至少一项目标属性。在一些实施例中,目标属性可以包括但不限于语义信息、详细程度、理解难度、播放速度、播放声音的音色、播放画质、画面色调等或其组合。在一些实施例中,目标属性至少包括语义信息。语义信息指的是处理设备112根据用户的对话信息(例如一个问题)确定的反馈信息所要表达的含义。在一些实施例中,处理设备112可以根据提取的对话特征信息来确定语义信息。在一些实施例中,语义信息可以以关键词的形式表达。例如,基于用户最新发送的对话信息提取的关键词为“你好”,确定的语义信息可以是“问好”。
在一些实施例中,可以通过属性确定规则基于用户特征信息确定目标属性。属性确定规则可以用于确定全部的目标属性或单个的目标属性。
在一些实施例中,属性确定规则可以包括将单个用户特征信息与预设的参考信息(例如类别或阈值)进行比较,确定目标属性。可以通过单个用户特征信息确定单个或多个目标 属性,也可以通过多个用户特征确定单个目标属性。例如,可以根据用户的年龄与年龄阈值比较来确定多个目标属性,若大于第一年龄阈值(例如,60岁),则详细程度为详细,理解难度为简单,播放速度为慢速,音色为常规,画面色调为冷色调等沉稳色调。又例如,若用户的年龄小于第二年龄阈值(例如10岁),则详细程度为详细,理解难度为简单,播放速度为慢速,音色为儿童喜欢的音色(如卡通人物的音色),画面色调为暖色调。再例如,可以根据用户的学历与学历类别比较来确定理解难度,若学历属于小学生,则理解难度为较低。
在一些实施例中,属性确定规则可以包括将多个用户特征信息进行组合后与预设的参考信息进行比较,以确定单个目标属性。具体地,可以对多个用户特征信息赋予权重,以加权的方式进行组合。例如,用户年龄,用户学历,和用户的情绪三个特征可以进行组合,权重分别为0.3,0.4,0.3。用户年龄在18岁与50岁之间,则对应于年龄的理解难度评分为0.5(或适中)。学历为高中,则对应于学历的理解难度评分为0.3(或偏低)。情绪为暴躁或不耐烦,则对应于情绪的理解难度评分为0.3(或偏低)。对应于这三个用户特征的理解难度总分,可以是0.3*0.5+0.4*0.3+0.3*0.3=0.36。在一些实施例中,可以用数值(例如该理解难度总分)来直接反映目标属性。在一些实施例中,可以用一个类别来反映目标属性。例如该总分偏低,则详细程度可以为偏低。
在一些实施例中,处理设备112还可以基于机器学习模型确定目标属性。具体地,处理设备112可以获取至少一个训练后的目标属性确定模型;将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型;以及基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。关于使用机器学习模型确定目标属性的相关内容可参考图6及其描述,在此不再赘述。
在904中,处理设备112可以基于至少一项目标属性,判断是否可以确定目标音视频。
在906中,若可以确定目标音视频,处理设备112可以基于至少一项目标属性,自动确定目标音视频。通过至少一项目标属性,可以给用户提供个性化目标音视频内容,给用户提供更好的观看体验,也更有助于用户解答问题,和/或提高用户人机交互的体验。
在一些实施例中,处理设备112可以从存储设备140中获取数据库。数据库可以包括多个素材,例如一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像中的至少一个。数据库中的每个素材都具有至少一项基本属性,例如语义信息、详细程度、理解难度、情绪属性等。处理设备112可以基于至少一项目标属性,从数据库中选取一个与目标属性匹配的目标音视频。例如,用户发出的对话信息为“这个系统怎么用?”;处理器提取的对话特征信息可以是“系统”和“怎么用”;处理器112通过网络120获取到 用户的偏好设置,用户对目标音视频播放内容的偏好设置为详细;则处理器112确定的目标音视频可以是详细版的关于如何使用系统的视频。
处理器112还可以基于上述与目标属性匹配的一个或多个非音视频形式的素材,生成目标音视频。例如,处理器112可以基于文字生成语音,并将文字、图像和语音合成目标视频。根据数据库确定目标音视频的内容可以参考图7及其描述,在此不再赘述。
在一些实施例中,处理设备112可以根据机器学习模型基于至少一项目标属性确定目标音视频。关于根据机器学习模型确定目标音视频的内容可以参考图8及其描述,在此不再赘述。
在一些实施例中,目标音视频可以是单个视频。在一些实施例中,目标音视频可以包括多个按顺序排列的片段。终端播放所述目标音视频时,可以按顺序播放上述多个片段。例如,多个片段可以是内容相似的片段。处理器112可以从数据库中找到多个与目标属性匹配的音视频,并按照一定规则对多个音视频排序,组成一个完整的目标音视频。例如,处理器112可以按照音视频与目标属性的匹配度对上述多个音视频进行排序。再例如,处理器112可以按照目标属性中的某一项的数值对上述多个音视频进行排序,例如按照理解难度的值由低到高排序(即上述多个音视频的理解难度递增)。在一些实施例中,处理设备112可以根据机器学习模型(例如图8中描述的素材确定模型),直接生成含有多个片段的目标音视频。
在907中,若不可以确定目标音视频,处理设备112可以自动生成非音视频形式的素材。在一些实施例中,当处理设备112确定数据库中没有与目标属性匹配的目标音视频时,可以从数据库中获取与目标属性匹配的一个或多个非音视频形式的素材,例如文字、图片等。在一些实施例中,若用户提出的问题比较简单,处理器112也可以直接判定为不需要确定目标音视频,从而执行步骤907。可选地,处理器112可以直接将与目标属性匹配的文字或图片发送至终端,以与用户进行对话。
在905中,终端130可以通过网络120接收自动生成的目标音视频或非音视频形式素材。在908中,对于接收的目标音视频,用户可以通过终端130(例如,点击)播放目标音视频。在909中,终端130也可以设置为自动播放目标音视频。在播放目标音视频的过程中,终端130可以接收用户提供的用户反馈。用户提供的用户反馈可以包括暂停次数、暂停时长、回放次数、回放时长、快进次数、快进时长、慢播次数、慢播时长、是否提出新的问题、是否提前结束播放,等或其组合。这些用户反馈可以表明用户对于目标音视频的理解和吸收能力,和/或用户对目标音视频的接受程度(例如喜欢或不喜欢)。
在9010中,处理设备112可以通过用户在终端130上对目标音视频执行的操作来获 取用户反馈。例如,终端130可以确定用户在终端130对目标音视频的执行的暂停操作、回放操作、快进操作、慢播操作、关闭操作、发送新消息的操作等,继而通过网络120传输给处理设备112。具体地,终端130可以检测到用户对用户界面上的一些按钮的操作和/或用户在用户界面上的手势操作。例如终端130可以检测到用户对目标音视频点击的暂停按钮、回放、快进、慢速播放按钮、关闭按钮,确定执行暂停操作、回放操作、快进操作、慢播操作、关闭操作。例如,若用户对目标音视频的快进次数大于一定阈值(例如3次),这可能反映了目标音视频内容对于用户来说过于简单易懂,或用户不喜欢目标音视频的内容,或用户偏好于较快的播放速度等。
在9012中,处理设备112可以自动调整目标音视频的未播放片段。例如,响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,处理设备112可以基于所述用户反馈,自动调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。通过确定用户是否提供了用户反馈,处理器112可以进一步优化目标音视频的内容,为用户提供更好的观看效果,提高用户体验。
在一些实施例中,未播放片段的基本属性可以与目标属性相对应。基本属性可以仅包括部分目标属性,例如,详细程度。基本属性还可以包括全部目标属性。基本属性可以包括语义信息、详细程度、理解难度、播放速度、播放声音、播放画质、画面色调等或其组合。当检测到用户反馈,处理设备112可以根据用户反馈来调整目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。在一些实施例中,可以通过属性调整规则来确定调整后的基本属性。属性调整规则可以类似于属性确定规则。在一些实施例中,属性调整规则可以包括用户反馈中单种类型的反馈是否大于阈值。仅作为示例,若暂停次数大于阈值(例如,3),则可以将至少一个未播放片段调整为更详细的片段。若快进时长大于阈值(例如,3min),则可以调整目标音视频至少一个未播放片段的播放速度,例如可以调整为1.5倍播放速度。
在9011中,终端130可以通过网络120接收调整后的目标音视频。在一些实施例中,调整后的目标音视频可以将原先的目标音视频覆盖。在一些实施例中,调整后的目标音视频可以是新生成的目标音视频。附加地或可选地,终端130还可以给出调整后的目标音视频的提示。例如,显示调整目标音视频的基本属性(例如,显示“已将播放速度调整为1.5倍”)。
在909中,在一些实施例中,用户还可以通过终端130对调整后的目标音视频提供用户反馈,终端130可以继续接收用户反馈,从而再一次执行过程9010-9011中的步骤,直至用户不再提供用户反馈。
在9013中,处理器112可以确定客户是否还需要发送其他对话信息。例如,终端130 可以在其界面上显示对话框,询问用户是否还需要发送其他对话信息。例如,界面上可以显示“您还有其他问题吗?”若是,则再次执行步骤901中的内容,用户可以进一步输入对话信息。若否,则交互过程结束。界面上可以显示结束语,例如,“本次对话结束,谢谢”。
该注意的是,关于交互过程900的描述出于说明性目的,并不用于限制本申请的保护范围。对于本领域的技术人员来说,可以在本申请的指示下做出多个变体和修改。然而,这些变体和修改不会脱离本申请的保护范围。
需要说明的是,不同实施例可能产生的有益效果不同,在不同的实施例里,可能产生的有益效果可以是以上任意一种或几种的组合,也可以是其他任何可能获得的有益效果。
以上内容描述了本申请和/或一些其他的示例。根据上述内容,本申请还可以做出不同的变形。本申请披露的主题能够以不同的形式和例子所实现,并且本申请可以被应用于大量的应用程序中。后文权利要求中所要求保护的所有应用、修饰以及改变都属于本申请的范围。
同时,本申请使用了特定词语来描述本申请的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本申请至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”、或“一个实施例”、或“一替代性实施例”并不一定是指同一实施例。此外,本申请的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
本领域技术人员能够理解,本申请所披露的内容可以出现多种变型和改进。例如,以上所描述的不同系统组件都是通过硬件设备所实现的,但是也可能只通过软件的解决方案得以实现。例如:在现有的服务器上安装系统。此外,这里所披露的位置信息的提供可能是通过一个固件、固件/软件的组合、固件/硬件的组合或硬件/固件/软件的组合得以实现。
所有软件或其中的一部分有时可能会通过网络进行通信,如互联网或其他通信网络。此类通信能够将软件从一个计算机设备或处理器加载到另一个。例如:从放射治疗系统的一个管理服务器或主机计算机加载至一个计算机环境的硬件平台,或其他实现系统的计算机环境,或与提供确定轮椅目标结构参数所需要的信息相关的类似功能的系统。因此,另一种能够传递软件元素的介质也可以被用作局部设备之间的物理连接,例如光波、电波、电磁波等,通过电缆、光缆或者空气实现传播。用来载波的物理介质如电缆、无线连接或光缆等类似设备,也可以被认为是承载软件的介质。在这里的用法除非限制了有形的“储存”介质,其他表示计算机或机器“可读介质”的术语都表示在处理器执行任何指令的过程中参与的介质。
本申请各部分操作所需的计算机程序编码可以用任意一种或多种程序语言编写,包括 面向对象编程语言如Java、Scala、Smalltalk、Eiffel、JADE、Emerald、C++、C#、VB.NET、Python等,常规程序化编程语言如C语言、Visual Basic、Fortran 2003、Perl、COBOL 2002、PHP、ABAP,动态编程语言如Python、Ruby和Groovy,或其他编程语言等。该程序编码可以完全在用户计算机上运行、或作为独立的软件包在用户计算机上运行、或部分在用户计算机上运行部分在远程计算机运行、或完全在远程计算机或服务器上运行。在后种情况下,远程计算机可以通过任何网络形式与用户计算机连接,例如,局域网(LAN)或广域网(WAN)、或连接至外部计算机(例如通过因特网)、或在云计算环境中、或作为服务使用如软件即服务(SaaS)。
此外,除非权利要求中明确说明,本申请所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本申请流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本申请实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的服务器或移动设备上安装所描述的系统。
同理,应当注意的是,为了简化本申请披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本申请实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本申请对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。
一些实施例中使用了描述属性、数量的数字,应当理解的是,此类用于实施例描述的数字,在一些示例中使用了修饰词“大约”、“近似”或“大体上”来修饰。除非另外说明,“大约”、“近似”或“大体上”表明所述数字允许有±20%的变化。相应地,在一些实施例中,说明书和权利要求中使用的数值参数均为近似值,该近似值根据个别实施例所需特点可以发生改变。在一些实施例中,数值参数应考虑规定的有效数位并采用一般位数保留的方法。尽管本申请一些实施例中用于确认其范围广度的数值域和参数为近似值,在具体实施例中,此类数值的设定在可行范围内尽可能精确。
针对本申请引用的每个专利、专利申请、专利申请公开物和其他材料,如文章、书籍、说明书、出版物、文档、物件等,特将其全部内容并入本申请作为参考。与本申请内容不一致或产生冲突的申请历史文件除外,对本申请权利要求最广范围有限制的文件(当前或之后附加于本申请中的)也除外。需要说明的是,如果本申请附属材料中的描述、定义、和/或术 语的使用与本申请所述内容有不一致或冲突的地方,以本申请的描述、定义和/或术语的使用为准。
最后,应当理解的是,本申请中所述实施例仅用以说明本申请实施例的原则。其他的变形也可能属于本申请的范围。因此,作为示例而非限制,本申请实施例的替代配置可视为与本申请的教导一致。相应地,本申请的实施例不限于本申请明确介绍和描述的实施例。

Claims (26)

  1. 一种确定目标音视频的方法,在计算设备上实现,所述计算设备具有至少一个处理器和至少一个存储设备,其特征在于,所述方法包括以下步骤:
    获取与用户相关的对话信息;
    确定所述对话信息的对话特征信息;
    获取所述用户的用户特征信息;
    基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及
    基于所述至少一项目标属性,确定所述目标音视频。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述至少一项目标属性,确定所述目标音视频包括:
    获取数据库,所述数据库包括多个素材,所述多个素材包括以下中的至少一项素材:一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像;以及
    基于所述至少一项目标属性和所述数据库,确定所述目标音视频。
  3. 如权利要求2所述的方法,其特征在于,所述基于所述至少一项目标属性,确定所述目标音视频进一步包括:
    对于所述至少一项目标属性中的每一项,
    计算所述数据库中多个素材中的每个素材与所述目标属性的匹配度;
    对于所述数据库中多个素材中的每个素材,
    基于所述素材的对应于所述至少一项目标属性的至少一个匹配度,确定匹配总分;
    基于所述数据库中多个素材对应的多个匹配分数,从所述多个素材中,基于所述匹配分数,选择一个或多个目标素材;以及
    基于所述一个或多个目标素材,确定所述目标音视频。
  4. 如权利要求3所述的方法,其特征在于,所述基于所述至少一项目标属性,确定所述目标音视频进一步包括:
    基于所述一个或多个目标素材和所述至少一项目标属性,通过对所述一个或多个目标素 材的基本属性进行调整来生成所述目标音视频。
  5. 如权利要求1所述的方法,其特征在于,所述目标音视频的至少一项目标属性包括内容属性、详细程度、理解难度、播放速度、画面色调或音色中的一项或多项。
  6. 如权利要求1所述的方法,其特征在于,所述基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性包括:
    获取至少一个训练后的目标属性确定模型;
    将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型;以及
    基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。
  7. 如权利要求1所述的方法,其特征在于,基于所述至少一项目标属性,确定所述目标音视频包括:
    获取训练后的素材确定模型;
    将所述对话特征信息输入所述素材确定模型;
    基于所述素材确定模型的输出,确定初始音视频;以及
    基于所述至少一项目标属性,通过对所述初始音视频的基本属性进行调整来生成所述目标音视频。
  8. 如权利要求1所述的方法,其特征在于,所述目标音视频包括一个或多个片段。
  9. 如权利要求8所述的方法,其特征在于,所述方法进一步包括:
    确定在所述目标音视频播放的过程中,用户是否提供了用户反馈;以及
    响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,
    基于所述用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。
  10. 如权利要求9所述的方法,其特征在于,在所述目标音视频播放的过程中,所述用户提供的所述用户反馈包括以下中的一项或多项:暂停次数、暂停时长、回放次数、回放时 长、快进次数、快进时长、慢播次数、慢播时长、是否提出新的问题以及是否提前结束播放。
  11. 如权利要求1所述的方法,其特征在于,所述用户特征信息包括用户个人信息,所述用户个人信息包括以下中的一项或多项:年龄、性别、学历、工作背景以及健康状况。
  12. 如权利要求1所述的方法,所述用户特征信息包括用户的偏好信息,所述用户的偏好信息包括所述用户的偏好设置、所述用户当前的情绪或所述用户过去针对历史音视频提供的历史用户反馈信息中的至少一项。
  13. 一种确定目标音视频的系统,其特征在于,所述系统包括:
    用于存储计算机指令的至少一个存储器;
    与所述存储器通讯的至少一个处理器,其中当所述至少一个处理器执行所述计算机指令时,所述至少一个处理器使所述系统执行:
    获取与用户相关的对话信息;
    确定所述对话信息的对话特征信息;
    获取所述用户的用户特征信息;
    基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及
    基于所述至少一项目标属性,确定所述目标音视频。
  14. 如权利要求13所述的系统,其特征在于,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器使所述系统进一步执行:
    获取数据库,所述数据库包括多个素材,所述多个素材包括以下中的至少一项素材:一个或多个候选音频、一个或多个候选视频、一个或多个候选文字、一个或多个候选图像;以及
    基于所述至少一项目标属性和所述数据库,确定所述目标音视频。
  15. 如权利要求14所述的系统,其特征在于,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器使所述系统进一步执行:
    对于所述至少一项目标属性中的每一项,
    计算所述数据库中多个素材中的每个素材与所述目标属性的匹配度;
    对于所述数据库中多个素材中的每个素材,
    基于所述素材的对应于所述至少一项目标属性的至少一个匹配度,确定匹配总分;
    基于所述数据库中多个素材对应的多个匹配分数,从所述多个素材中,基于所述匹配分数,选择一个或多个目标素材;以及
    基于所述一个或多个目标素材,确定所述目标音视频。
  16. 如权利要求15所述的系统,其特征在于,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器使所述系统进一步执行:
    基于所述一个或多个目标素材和所述至少一项目标属性,通过对所述一个或多个目标素材的基本属性进行调整来生成所述目标音视频。
  17. 如权利要求13所述的系统,其特征在于,所述目标音视频的至少一项目标属性包括内容属性、详细程度、理解难度、播放速度、画面色调或音色中的一项或多项。
  18. 如权利要求13所述的系统,其特征在于,为基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性,所述至少一个处理器使所述系统进一步执行:
    获取至少一个训练后的目标属性确定模型;
    将所述对话特征信息和所述用户特征信息的至少一部分输入所述至少一个训练后的目标属性确定模型;以及
    基于所述至少一个训练后的目标属性确定模型的输出,确定所述至少一项目标属性。
  19. 如权利要求13所述的系统,其特征在于,为基于所述至少一项目标属性,确定所述目标音视频,所述至少一个处理器使所述系统进一步执行:
    获取训练后的素材确定模型;
    将所述对话特征信息输入所述素材确定模型;
    基于所述素材确定模型的输出,确定初始音视频;以及
    基于所述至少一项目标属性,通过对所述初始音视频的基本属性进行调整来生成所述目标音视频。
  20. 如权利要求13所述的系统,其特征在于,所述目标音视频包括一个或多个片段。
  21. 如权利要求20所述的系统,其特征在于,所述至少一个处理器使所述系统进一步执行:
    确定在所述目标音视频播放的过程中,用户是否提供了用户反馈;以及
    响应于在所述目标音视频播放的过程中,所述用户提供了用户反馈,
    基于所述用户反馈,确定是否需要调整所述目标音视频的一个或多个片段中的至少一个未播放片段的基本属性。
  22. 如权利要求21所述的系统,其特征在于,在所述目标音视频播放的过程中,所述用户提供的所述用户反馈包括以下中的一项或多项:暂停次数、暂停时长、回放次数、回放时长、快进次数、快进时长、慢播次数、慢播时长、是否提出新的问题以及是否提前结束播放。
  23. 如权利要求13所述的系统,其特征在于,所述用户特征信息包括用户个人信息,所述用户个人信息包括以下中的一项或多项:年龄、性别、学历、工作背景以及健康状况。
  24. 如权利要求13所述的系统,所述用户特征信息包括用户的偏好信息,所述用户的偏好信息包括所述用户的偏好设置、所述用户当前的情绪或所述用户过去针对历史音视频提供的历史用户反馈信息中的至少一项。
  25. 一种确定目标音视频的装置,其特征在于,包括:
    获取模块,用于获取与用户相关的对话信息;
    确定模块,用于确定所述对话信息的对话特征信息;
    获取模块,用于获取所述用户的用户特征信息;
    确定模块,用于基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及
    确定模块,用于基于所述至少一项目标属性,确定所述目标音视频。
  26. 一种计算机可读存储介质,所述存储介质存储计算机指令,当计算机读取存储介质 中的计算机指令后,计算机执行一种方法,所述方法包括:
    获取与用户相关的对话信息;
    确定所述对话信息的对话特征信息;
    获取所述用户的用户特征信息;
    基于所述对话特征信息和用户特征信息,确定与所述目标音视频对应的至少一项目标属性;以及
    基于所述至少一项目标属性,确定所述目标音视频。
PCT/CN2020/141192 2020-12-30 2020-12-30 一种确定目标音视频的方法及系统 WO2022141142A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141192 WO2022141142A1 (zh) 2020-12-30 2020-12-30 一种确定目标音视频的方法及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141192 WO2022141142A1 (zh) 2020-12-30 2020-12-30 一种确定目标音视频的方法及系统

Publications (1)

Publication Number Publication Date
WO2022141142A1 true WO2022141142A1 (zh) 2022-07-07

Family

ID=82259959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141192 WO2022141142A1 (zh) 2020-12-30 2020-12-30 一种确定目标音视频的方法及系统

Country Status (1)

Country Link
WO (1) WO2022141142A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140112460A1 (en) * 2003-05-05 2014-04-24 Interactions Corporation Apparatus and Method for Processing Service Interactions
CN104598445A (zh) * 2013-11-01 2015-05-06 腾讯科技(深圳)有限公司 自动问答系统和方法
CN107958001A (zh) * 2016-10-14 2018-04-24 阿里巴巴集团控股有限公司 一种智能问答的实现方法及装置
CN110008308A (zh) * 2019-01-24 2019-07-12 阿里巴巴集团控股有限公司 针对用户问句补充信息的方法和装置
CN111586244A (zh) * 2020-05-20 2020-08-25 深圳康佳电子科技有限公司 一种语音客服方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140112460A1 (en) * 2003-05-05 2014-04-24 Interactions Corporation Apparatus and Method for Processing Service Interactions
CN104598445A (zh) * 2013-11-01 2015-05-06 腾讯科技(深圳)有限公司 自动问答系统和方法
CN107958001A (zh) * 2016-10-14 2018-04-24 阿里巴巴集团控股有限公司 一种智能问答的实现方法及装置
CN110008308A (zh) * 2019-01-24 2019-07-12 阿里巴巴集团控股有限公司 针对用户问句补充信息的方法和装置
CN111586244A (zh) * 2020-05-20 2020-08-25 深圳康佳电子科技有限公司 一种语音客服方法及系统

Similar Documents

Publication Publication Date Title
CN110869969B (zh) 用于在通信会话内生成个性化响应的虚拟助手
KR102333505B1 (ko) 소셜 대화형 입력들에 대한 컴퓨터 응답 생성
US11430439B2 (en) System and method for providing assistance in a live conversation
CN105895087B (zh) 一种语音识别方法及装置
US11705096B2 (en) Autonomous generation of melody
JP6876752B2 (ja) 応答方法及び装置
CN112074857A (zh) 组合机器学习和社交数据以生成个性化推荐
US20150243279A1 (en) Systems and methods for recommending responses
US10803850B2 (en) Voice generation with predetermined emotion type
CN111428010B (zh) 人机智能问答的方法和装置
US11928985B2 (en) Content pre-personalization using biometric data
CN111201567A (zh) 用于与数字媒体内容交互的口语、面部和姿势通信设备和计算体系架构
CN112328849A (zh) 用户画像的构建方法、基于用户画像的对话方法及装置
CN113160819B (zh) 用于输出动画的方法、装置、设备、介质和产品
CN112740132A (zh) 简答题评分预测
CN111883131B (zh) 语音数据的处理方法及装置
JP2023036574A (ja) 対話推薦方法、モデルの訓練方法、装置、電子機器、記憶媒体ならびにコンピュータプログラム
WO2020213468A1 (ja) 情報処理システム、情報処理方法、及びプログラム
US20220050865A1 (en) Systems and methods for leveraging acoustic information of voice queries
CN113761156A (zh) 人机交互对话的数据处理方法、装置、介质及电子设备
CN112910761B (zh) 即时通讯方法、装置、设备、存储介质以及程序产品
WO2022141142A1 (zh) 一种确定目标音视频的方法及系统
CN113392640B (zh) 一种标题确定方法、装置、设备及存储介质
CN113301352B (zh) 在视频播放期间进行自动聊天
CN114115533A (zh) 智能交互方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967489

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967489

Country of ref document: EP

Kind code of ref document: A1