CN115086586A

CN115086586A - Method and device for generating double-recording video file

Info

Publication number: CN115086586A
Application number: CN202210639750.8A
Authority: CN
Inventors: 黄迎春; 李聪; 李尔玙; 唐晓东; 郑建; 羽翼
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-20

Abstract

The invention discloses a method and a device for generating a double-recording video file, and relates to the technical field of video and audio processing. One embodiment of the method comprises: determining a current tactical node from a plurality of tactical nodes corresponding to the target product; recording a video file and an audio file for the current telephony node; determining a response result for the current telephony node from the audio file; responding to the response result representing that the target user passes through semantic recognition, and continuously recording the video file and the audio file of the target user aiming at the next speech technology node of the current speech technology node until an interruption condition is met; and synthesizing the double-recording video file of the target user aiming at the target product, and sending the double-recording video file to a server. The embodiment can automatically switch each operation node involved in the business process, thereby relieving the working pressure of related workers.

Description

Method and device for generating double-recording video file

Technical Field

The invention relates to the technical field of video and audio processing, in particular to a method for generating a double-recording video file, a method for processing the double-recording video file and a device thereof.

Background

For some key services, such as insurance sales, security account opening, house property trading, etc., a sound recording and video recording (double recording) file for the key services needs to be generated to store the evidence of the processing process of the key services. The double-recording video file is used for recording audio and video data of operation statements and the like of related personnel in the business process. In the process of generating the double-recording video file, relevant personnel need to manually switch each conversation related to the business process, the working pressure of the relevant personnel is high, and the risk of misoperation is increased due to manual switching.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for generating a dual-recording video file, a method for processing a dual-recording video file, and an apparatus, which can automatically switch each telephony node involved in a business process, thereby reducing the work pressure of related workers and effectively reducing the risk of occurrence of misoperation.

In a first aspect, an embodiment of the present invention provides a method for generating a dual-recording video file, including:

determining a current tactical node from a plurality of tactical nodes corresponding to the target product;

recording a video file and an audio file of a target user aiming at the current conversation node;

determining a response result of the target user for the current speech node according to the audio file;

responding to the response result representing that the target user passes through semantic recognition, and continuously recording the video file and the audio file of the target user aiming at the next speech technology node of the current speech technology node until an interruption condition is met;

and synthesizing the video file and the audio file into a double-recording video file of the target user aiming at the target product, and sending the double-recording video file to a server.

Optionally, after the determining the current technology node, the method further includes:

acquiring the conversation content corresponding to the current conversation node;

and displaying the dialect content in a double-recording interface, and playing the dialect content.

Optionally, the playing the verbal content includes:

receiving a play instruction sent by a user aiming at the double-recording interface, wherein the play instruction comprises: pause instruction, resume play instruction and replay operation instruction;

and playing the dialect content according to the playing instruction.

Optionally, the determining, according to the audio file, a response result of the target user for the current conversation node includes:

determining text information of the audio file corresponding to the current speech node;

sending a semantic recognition request to a server, the semantic recognition request comprising: the text information and the node information of the current speech node;

and receiving a response result returned by the server.

Optionally, after receiving the response result returned by the server, the method further includes:

displaying the text information and the response result in a double recording interface;

and displaying the prompt information of the re-answer in response to the response result representation failing to pass the semantic recognition.

Optionally, the sending the dual-recording video file to a server includes:

determining whether the double-recording video file meets a slicing condition;

splitting the double-recording video file into a plurality of fragment files in response to the double-recording video file meeting a fragment condition;

and sending the plurality of fragmented files to a server.

In a second aspect, an embodiment of the present invention provides a method for processing a dual-record video file, including:

receiving a double-recording video file sent by a terminal, wherein the double-recording video file is used for recording a service processing process of a target user aiming at a target product;

and saving the double-recording video file.

Optionally, the method further comprises:

receiving a semantic recognition request, the semantic recognition request comprising: text information and node information of the current speech operation node;

determining standard conversational information corresponding to the node information;

and determining a response result of the semantic recognition request according to the matching result of the text information and the standard language information, and returning the response result to the terminal.

In a third aspect, an embodiment of the present invention provides a device for generating a dual-recording video file, including:

the node determining module is used for determining a current conversation node from a plurality of conversation nodes corresponding to the target product;

the first recording module is used for recording a video file and an audio file of a target user aiming at the current speech node;

the identification module is used for determining a response result of the target user aiming at the current talk node according to the audio file;

the second recording module is used for responding to the response result that the target user is represented through semantic recognition, and continuously recording the video file and the audio file of the target user aiming at the next speech technology node of the current speech technology node until an interruption condition is met;

and the synthesis module is used for synthesizing the video file and the audio file into a double-recording video file of the target product for the target user and sending the double-recording video file to a server.

Optionally, the method further comprises:

the conversation playing module is used for acquiring conversation contents corresponding to the current conversation node;

Optionally, the tactical playing module is specifically configured to:

and playing the dialect content according to the playing instruction.

In a fourth aspect, an embodiment of the present invention provides a processing apparatus for dual-recording video files, including:

the file receiving module is used for receiving a double-recording video file sent by a terminal, and the double-recording video file is used for recording a service processing process of a target user for a target product;

and the file storage module is used for storing the double-recording video file.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments described above.

In a sixth aspect, the present invention provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method of any one of the above embodiments.

In a seventh aspect, an embodiment of the present invention provides a computer program product, which includes a computer program, and when the program is executed by a processor, the method described in any of the above embodiments is implemented.

One embodiment of the above invention has the following advantages or benefits: and respectively generating a video file and an audio file in the service processing process. The video file only contains image data of each speech operation node involved in the business processing process, and does not contain audio data of each speech operation node. And performing semantic recognition through the audio file to realize automatic switching of each speech technology node. And finally, synthesizing the audio file and the video file into a double-recording video file. Therefore, the scheme of the embodiment of the invention can automatically switch each speech node involved in the business process, thereby reducing the working pressure of related workers and effectively reducing the risk of misoperation.

In addition, in some brands of terminals equipped with the Android system, only one recording input can exist at the same time. If a video file and an audio file with audio data are recorded simultaneously, audio input conflicts may result. Aspects of embodiments of the present invention generate a video file and an audio file, respectively, without audio data. The audio file is used for semantic recognition to realize automatic switching of each speech node, and the audio file and the video file can be synthesized into a final double-recording video file. Therefore, the scheme of the embodiment of the invention has better compatibility and can be suitable for terminals of multiple brands.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic flowchart of a method for generating a dual-record video file according to a first embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for generating a dual-record video file according to a second embodiment of the present invention;

fig. 3 is a flowchart illustrating a processing method for dual-recording video files according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a device for generating a dual-record video file according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a processing apparatus for dual-recording video files according to an embodiment of the present invention;

fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

Fig. 1 is a schematic flowchart of a method for generating a dual-record video file according to a first embodiment of the present invention, as shown in fig. 1, the method includes:

step 101: and determining the current talks node from a plurality of talks nodes corresponding to the target product.

The scheme of the embodiment of the invention is applied to the terminal. The target product may be a financial product, an insurance product, or the like. The business processing process of the target product needs to be recorded, and a double-recording file is generated, so that the behaviors of selling, purchasing, resources and the like of the target product can be traced back and blamed, and disputes are reduced to a certain extent.

The business process of the target product can be divided into a plurality of dialect nodes with execution sequence. A session node may correspond to a business segment, such as soliciting user comments, explicating identity, and explicitly alerting users to notes, etc. The tactical nodes can correspond to information such as tactical titles, tactical prompts, tactical contents, tactical question types, standard tactical information and the like.

Step 102: and recording the video file and the audio file of the target user aiming at the current speech node.

And respectively recording a video file and an audio file corresponding to the business process of the target product. The video file only contains image data of each conversation node and does not contain audio data.

Step 103: and determining the response result of the target user to the current speech node according to the audio file.

And acquiring voice data of the target user aiming at the current speech node through a microphone and other tools, and converting the voice data into text information for recognition. The terminal can determine the response result of the target user aiming at the current operation node according to the text information and the problem type corresponding to the current operation node.

The terminal can also send the text information, the problem type and other information corresponding to the current speech node to the server, and the response result is determined according to the return information of the server. Specifically, determining the text information of the audio file corresponding to the prior art node; sending a semantic recognition request to a server, the semantic recognition request comprising: text information and node information of the current speech operation node; and receiving a response result returned by the server.

There are various methods of determining the outcome of the response. For example, the textual information may be matched to standard conversational information corresponding to the current conversational node. If words or sentences matched with the speech operation node information appear in the text information, for example, the speech operation node information is the opinion of a consulting user, and the text information is 'agreement', 'confirmation' and the like, judging that the response result represents that the target user passes semantic recognition; and if the text information is 'disagreeable', 'repudiation', and the like, judging that the response result represents that the target user fails in semantic recognition.

And inputting text information, problem types and other information corresponding to the current speech technology node into the semantic recognition model corresponding to the current speech technology node, and obtaining a response result of the target user for the current speech technology node according to an output result of the semantic recognition model.

Step 104: and responding to the response result to represent that the target user passes through semantic recognition, and continuously recording the video file and the audio file of the target user aiming at the next speech technology node of the current speech technology node until the interruption condition is met.

The node information and the sequence information of the plurality of telephony nodes corresponding to the business process of the target product can be stored in the terminal in advance. And under the condition that the response result represents that the target user passes semantic recognition, acquiring node information of the next speech technology node, and continuously recording a video file and an audio file of the next speech technology node.

And in the case that the response result representation passes the non-semantic recognition, the recording of the video file and the audio file can be stopped. The text information and the response result can also be displayed in a double recording interface; and displaying the prompt information of the re-answer in response to the response result representation failing to pass the semantic recognition. By clicking the prompt message of the re-answer, the target user can re-state and respond to the current speech so as to facilitate the user to control the smooth operation of the double-recording process.

When an interruption condition occurs, the recording process ends. The interrupt conditions may include: receiving an incoming call, finishing recording by all the telephone operation nodes corresponding to the target product, receiving an instruction of finishing recording, switching to other applications and the like.

Step 105: and synthesizing the video file and the audio file into a double-recording video file of the target product by the target user, and sending the double-recording video file to the server.

The video file only contains image data of each conversation node and does not contain audio data. The audio file contains audio data of each conversational node. The video file and the audio file can be synthesized into a double-recording video file of a target user aiming at a target product according to the time information in the video file and the audio file. The double-recording video file comprises video data and audio data of a target user in a service processing process aiming at a target product. And sending the double-recording video file to a server, and archiving and storing the double-recording video file by the server.

In the embodiment of the invention, the video file and the audio file of the business process are respectively generated. The video file only contains image data of each speech node involved in the business process, and does not contain audio data of each speech node. And performing semantic recognition through the audio file to realize automatic switching of each speech technology node. And finally, synthesizing the audio file and the video file into a double-recording video file. Therefore, the scheme of the embodiment of the invention can automatically switch each conversational node involved in the business process, thereby reducing the working pressure of related workers and effectively reducing the risk of misoperation.

In addition, some brands of mobile phones with Android systems can only have one recording input. If a video file and an audio file with audio data are recorded simultaneously, audio input conflicts may result. Aspects of embodiments of the present invention generate a video file and an audio file, respectively, without audio data. The audio file is used for semantic recognition to realize automatic switching of each speech node, and the audio file and the video file can be synthesized into a final double-recording video file. Therefore, the scheme of the embodiment of the invention has better compatibility and can be suitable for terminals of multiple brands.

In order to solve the problem that the double-recording video file is too large and causes inconvenient transmission of a mobile network, the double-recording video file can be split into a plurality of fragment files for transmission. Specifically, whether the double-recording video file meets the slicing condition is determined; splitting the double-recording video file into a plurality of fragment files in response to the double-recording video file meeting the fragment conditions; and sending the plurality of fragmented files to a server.

The slicing conditions can be set according to specific requirements. The slicing condition may be that the size of the dual-recording video file exceeds a preset threshold, the duration of the dual-recording video file exceeds a preset duration, and the like. When the double-recording video file meets the fragmentation condition, the double-recording video file is split into a plurality of fragment files, the size of each fragment file does not exceed the preset occupied space, and the fragment files are respectively sent to the server, so that the occurrence of transmission failure is reduced, and the network transmission pressure is relieved.

The terminal can send the fragment file to the server in the following way: and the terminal determines the fragment identifier and the md5 value of the fragment file, reads binary data at the starting position of the fragment file in the double-record video file, and uploads the fragment file, the binary data, the fragment identifier, the md5 value and the like to the server. And the server receives the data and detects whether all the fragment files are successfully uploaded. If not, returning the fragment file missing information to the terminal. If so, merging the fragment files according to the binary data corresponding to the fragment files, and converting the merged video file into a preset format for storage.

After the terminal sends the fragmented file to the server, it may be determined whether the fragmented file has been successfully uploaded in the following manner. And the terminal determines the md5 value of the fragmented file and uploads the md5 value of the fragmented file to the server. And the server inquires whether the fragment file is uploaded according to the md5 value and returns an inquiry result to the terminal. And under the condition that the query result represents that the fragmented file is not uploaded, the terminal sends the fragmented file to the server again.

Fig. 2 is a flowchart illustrating a method for generating a dual-record video file according to a second embodiment of the present invention, as shown in fig. 2, the method includes:

step 201: and determining the current talks node from a plurality of talks nodes corresponding to the target product.

Step 202: and acquiring the dialect content corresponding to the current dialect node.

Step 203: and displaying the conversation contents in the double-recording interface and playing the conversation contents.

Receiving a play instruction sent by a user aiming at the double-recording interface, wherein the play instruction comprises the following steps: pause instruction, resume play instruction and replay operation instruction; playing the talk content according to the playing instruction.

When the speech content is played, a male voice/female voice mode can be provided, a voice synthesis mode (on-line/off-line) can be set, and the like. And in the double recording interface, buttons such as a pause button, a play resuming button, a replay operation button and the like are arranged, so that a user can perform operations such as pause, play resuming, replay and the like in the broadcasting process.

Step 204: and recording the video file and the audio file of the target user aiming at the current speech node.

Step 205: and determining the response result of the target user to the current speech node according to the audio file.

Step 206: and responding to the response result to represent that the target user passes through semantic recognition, and continuously recording the video file and the audio file of the target user aiming at the next speech technology node of the current speech technology node until the interruption condition is met.

Step 207: and synthesizing the video file and the audio file into a double-recording video file of the target product by the target user, and sending the double-recording video file to the server.

The embodiment of the invention provides a terminal video recording solution, which can perform speech display, speech broadcast, voice recognition and semantic recognition in the recording process. The double-recording interface is used for displaying the technical content, and various operation buttons are further provided in the double-recording program to help a user to reasonably control the double-recording process and improve the user experience of the user in the double-recording process.

In order to better implement the scheme of the embodiment of the present invention, the embodiment of the present invention further provides a dual recording interface, where a left portion of the dual recording interface displays a title list, a middle portion of the dual recording interface displays related information of a title, and a right portion of the dual recording interface displays an operation button. The dialogies-related information may include: recording instructions, verbal text content, recording duration, storage size, geographic location, etc. The operation buttons may include: end recording, previous, next, replay, etc. The following describes the display content in the dual recording interface.

The title of the word: displaying the title content of the dialect, wherein the title content cannot be clicked, and highlighting the current dialect node; if there are too many nodes, it can scroll up and down.

Beside the conversation title list, a 'pack/expand' button is arranged to control whether the conversation title list is expanded or not.

Verbal content-stow/deploy: and carrying out retraction and expansion operation on the recording requirement specification and the voice broadcast content area.

And (3) starting recording: after clicking, the recording of the video starts, and the button is changed to end the recording.

And (5) finishing recording: after clicking, popping up a prompt box to prompt a user whether to confirm that the current recording is finished, and if the user selects to cancel, continuing to record; if the user selects and determines, the recording is finished, the video file is processed and recorded, and the recording result is fed back to the service upper layer logic.

The last step: and switching to the position of the last technical node, and if the first technical node is the technical node, the button is in an unavailable state.

The next step is: and switching to the next technical node position, and if the last technical node is selected, the button is in an unavailable state.

Broadcasting: after clicking, the voice content is broadcasted, and the button is changed into pause.

Pausing: after clicking, the speech content broadcast is suspended.

Rebroadcasting: and after clicking, re-broadcasting the content of the current speech node.

In addition, in order to facilitate the user to make statement and answer to the current speech, buttons such as start answer, end answer and answer again can be arranged in the double-recording interface.

The answer is started: after the user clicks, the system records the audio data answered by the user, and the button is changed into the end answer.

And (4) ending the answer: after the user clicks, the system finishes recording, converts the audio data of the user into text information, uploads the text information to the server for semantic recognition, and receives a mark that whether the text information passes through recognition or not returned by the server.

Answering again: after clicking on this button, the user may reply again, with the button displayed as the end reply. After semantic recognition of the text message, the button is redisplayed as a reply.

Fig. 3 is a flowchart illustrating a method for processing a dual-record video file according to a third embodiment of the present invention, as shown in fig. 3, the method includes:

step 301: and receiving a double-recording video file sent by the terminal, wherein the double-recording video file is used for recording the service processing process of a target user aiming at a target product.

Step 302: and saving the double-recording video file.

Under the condition that the double-recording video file is too large, the terminal can split the double-recording video file into a plurality of fragment files for transmission. Specifically, the terminal determines a fragment identifier and an md5 value of the fragment file, reads binary data at the start position of the fragment file in the dual-recording video file, and uploads the fragment file, the binary data, the fragment identifier, the md5 value, and the like to the server. And the server receives the data and detects whether all the fragment files are successfully uploaded. If not, returning the fragment file missing information to the terminal. If yes, merging the fragment files according to the binary data corresponding to the fragment files, and converting the merged video files into a preset format for storage.

The scheme of the embodiment of the invention is applied to the server side. The server receives the double-recording video file sent by the terminal, and archives and stores the double-recording video file, wherein the double-recording video file is used for recording the business processing process of the target user aiming at the target product, so that the business processing process of the target product can be traced and competed.

In one embodiment of the invention, the method further comprises: receiving a semantic recognition request, the semantic recognition request comprising: text information and node information of the current speech operation node; determining standard conversational information corresponding to the node information; and determining a response result of the semantic recognition request according to the matching result of the text information and the standard speech information, and returning the response result to the terminal.

There are various methods of determining the outcome of the response. For example, the textual information may be matched to standard conversational information corresponding to the current conversational node. If words or sentences matched with the phonetics node information appear in the text information, for example, the phonetics node information is the opinion of the inquiring user, and the text information is 'agreement', 'confirmation' and the like, the answer result is judged to represent that the target user is identified by semantics.

Note that the node information and sequence information of the plurality of speech nodes corresponding to the target product need to be stored in the terminal in advance. When the corresponding dialect node of the target product is changed, for example: the server can send an upgrade prompt or an upgrade program package to the terminal so that the terminal can update the information of the speech operation nodes in the terminal in time and adverse effects caused by untimely terminal upgrade are reduced.

Fig. 4 is a schematic structural diagram of an apparatus for generating a dual-record video file according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes:

a node determining module 401, configured to determine a current conversational node from a plurality of conversational nodes corresponding to a target product;

a first recording module 402, configured to record a video file and an audio file of a target user for the current speech node;

an identifying module 403, configured to determine, according to the audio file, a response result of the target user for the current conversational node;

a second recording module 404, configured to respond to that the response result represents that the target user passes through semantic recognition, and continue to record the video file and the audio file of the target user for a subsequent speech technology node of the current speech technology node until an interruption condition is met;

and a synthesizing module 405, configured to synthesize the video file and the audio file into a double-recording video file of the target product for the target user, and send the double-recording video file to a server.

Optionally, the method further comprises:

a conversational playing module 406, configured to obtain conversational content corresponding to the current conversational node;

Optionally, the tactical playing module 406 is specifically configured to:

and playing the dialect content according to the playing instruction.

Optionally, the identification module 403 is specifically configured to:

and receiving a response result returned by the server.

Optionally, the method further comprises:

the prompt module 407 is configured to display the text information and the response result in a double recording interface;

and displaying prompt information of the re-answer in response to the response result representation failing to pass semantic recognition.

Optionally, the synthesis module 405 is specifically configured to:

determining whether the double-recording video file meets a slicing condition;

and sending the plurality of fragmented files to a server.

Fig. 5 is a schematic structural diagram of a device for processing a dual-record video file according to an embodiment of the present invention, as shown in fig. 5, the device includes:

the file receiving module 501 is configured to receive a dual-recording video file sent by a terminal, where the dual-recording video file is used to record a service processing process of a target user for a target product;

a file saving module 502, configured to save the double-recording video file.

Optionally, the method further comprises:

a semantic recognition module 503, configured to receive a semantic recognition request, where the semantic recognition request includes: text information and node information of the current speech operation node;

An embodiment of the present invention provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method of any of the embodiments described above.

Embodiments of the present invention provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the enterprise risk assessment method in the embodiments of the present invention.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: the device comprises a node determining module, a first recording module, an identification module, a second recording module and a synthesis module. The names of the modules do not form a limitation on the modules themselves in some cases, for example, the node determination module may be further described as a module for determining a current linguistic node from a plurality of linguistic nodes corresponding to the target product.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

recording a video file and an audio file of a target user aiming at the current speech node;

According to the technical scheme of the embodiment of the invention, the video file and the audio file of the business process are respectively generated. The video file only contains image data of each speech node involved in the business process, and does not contain audio data of each speech node. And performing semantic recognition through the audio file to realize automatic switching of each speech technology node. And finally, synthesizing the audio file and the video file into a double-recording video file. Therefore, the scheme of the embodiment of the invention can automatically switch each speech node involved in the business process, thereby reducing the working pressure of related workers and effectively reducing the risk of misoperation.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating a double-recording video file is applied to a terminal and comprises the following steps:

responding to the response result to represent that the target user passes through semantic recognition, and continuously recording the video file and the audio file of the target user aiming at the next speech technology node of the current speech technology node until an interruption condition is met;

2. The method of claim 1, wherein after determining the current telephony node, further comprising:

3. The method of claim 2, wherein the playing the verbal content comprises:

receiving a play instruction sent by a user aiming at the double recording interface, wherein the play instruction comprises: pause instruction, resume play instruction and replay operation instruction;

and playing the dialect content according to the playing instruction.

4. The method of claim 1, wherein determining the answer result of the target user to the current telephony node from the audio file comprises:

and receiving a response result returned by the server.

5. The method according to claim 4, wherein after receiving the response result returned by the server, further comprising:

and displaying prompt information answered again in response to the response result representing that the target user fails in semantic recognition.

6. The method of claim 1, wherein sending the dual-record video file to a server comprises:

determining whether the double-recording video file meets a slicing condition;

and sending the plurality of fragmented files to a server.

7. A processing method of a double-recording video file is applied to a server side and comprises the following steps:

and saving the double-recording video file.

8. The method of claim 7, further comprising:

9. A generation device of a double-recording video file is applied to a terminal and comprises the following components:

10. The apparatus of claim 9, further comprising:

the speech playing module is used for acquiring the speech content corresponding to the current speech node;

11. The apparatus of claim 10, wherein the tactical playback module is specifically configured to:

and playing the dialect content according to the playing instruction.

12. A processing device for double-recording video files is applied to a server side and comprises:

13. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.

15. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any of claims 1-8 when executed by a processor.