CN112131365A - Data processing method, device, equipment and medium - Google Patents

Data processing method, device, equipment and medium Download PDF

Info

Publication number
CN112131365A
CN112131365A CN202011006333.7A CN202011006333A CN112131365A CN 112131365 A CN112131365 A CN 112131365A CN 202011006333 A CN202011006333 A CN 202011006333A CN 112131365 A CN112131365 A CN 112131365A
Authority
CN
China
Prior art keywords
data
video
voice
client
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011006333.7A
Other languages
Chinese (zh)
Inventor
王锁平
周登宇
乔磊
曹传兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011006333.7A priority Critical patent/CN112131365A/en
Priority to PCT/CN2020/124256 priority patent/WO2021159734A1/en
Publication of CN112131365A publication Critical patent/CN112131365A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application discloses a data processing method, a device, equipment and a medium, which relate to a voice processing technology in artificial intelligence and can be applied to a block chain network, wherein the method comprises the following steps: receiving a service handling request sent by a client; acquiring first semantic information from the first voice data, and determining second voice data matched with the first semantic information according to the first semantic information; determining second video data matching the second voice data from a video database; and synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to second multimedia data replied by the client, wherein the second multimedia data is reply data of the client aiming at the synthesized data. By adopting the embodiment of the application, the accuracy of voice recognition can be improved, and the user experience is improved.

Description

Data processing method, device, equipment and medium
Technical Field
The present application relates to voice processing technologies in artificial intelligence, and in particular, to a data processing method, apparatus, device, and medium.
Background
The traditional business handling mode is generally a mode of manually handling the business, and the mode of manually handling the business cannot realize all-weather business handling, and the mode needs to be invested with higher cost. Therefore, the mode of carrying out business handling by the video robot call gradually replaces the mode of carrying out business handling by manpower, and business handling at any time and any place can be realized.
The conversation of the existing video robot is generally fixed, the video robot can only recognize the fixed telephone technology to perform corresponding reply, so that the recognized semantic information is not comprehensive enough, the voice recognition accuracy is low, the conversation accuracy of the video robot is low, and the user experience is poor.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device and a data processing medium, which can improve the accuracy of voice recognition, so that the accuracy of video robot conversation is improved, and user experience is improved.
An embodiment of the present application provides a data processing method, including:
receiving a service handling request sent by a client, wherein the service handling request comprises first voice data and first video data;
acquiring first semantic information from the first voice data, and determining second voice data matched with the first semantic information according to the first semantic information; the second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords;
determining second video data matched with the second voice data from a video database, wherein the video database comprises a plurality of video data;
and synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to first multimedia data replied by the client, wherein the first multimedia data is reply data of the client aiming at the synthesized data.
Optionally, the obtaining the first semantic information from the first voice data includes: converting the first voice data to obtain first text data corresponding to the first voice data; extracting keywords from the first text data to obtain at least two keywords; acquiring word meaning information of the at least two keywords; obtaining a word combination according to the at least two keywords, and obtaining combined word meaning information of the word combination; determining the incidence relation between the at least two keywords according to the word sense information and the combined word sense information; and determining semantic information corresponding to the first text data according to the at least two keywords and the incidence relation between the at least two keywords as the first semantic information.
Optionally, the determining, according to the first semantic information, second speech data matched with the first semantic information includes: acquiring similarity between the at least two keywords and a plurality of text data in the corpus to obtain a plurality of first similarities; determining text data corresponding to the association relationship according to the association relationship between the at least two keywords, and obtaining the similarity between the text data corresponding to the association relationship and the plurality of text data to obtain a plurality of second similarities; determining text data corresponding to the maximum similarity among the plurality of first similarities as first target text data, and determining text data corresponding to the maximum similarity among the plurality of second similarities as second target text data; determining the second text data according to the first target text data and the second target text data; and converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
Optionally, the determining second video data matching the second voice data from the video database includes: obtaining a semantic scene of the second text data and an application scene of each video data in the video database; determining a target application scene matched with the semantic scene from the application scenes; and determining the video data corresponding to the target application scene as the second video data.
Optionally, the synthesizing the second video data and the second voice data to obtain synthesized data includes: acquiring the voice time length of the second voice data; intercepting video data with the same time length as the voice time length from the second video data to serve as candidate video data; and synthesizing the second voice data and the candidate video data to obtain the synthesized data.
Optionally, the method further includes: intercepting a first image of a user corresponding to the client from the first video data; verifying the legality of the client according to the first image; if the client side has validity, executing the step of carrying out service processing according to the first multimedia data replied by the client side.
Optionally, the method further includes: if the client side does not have the legality, sending adjustment information for indicating the user to perform posture adjustment to the client side; acquiring second multimedia data sent by the client aiming at the adjustment information, wherein the second multimedia data comprises third video data; intercepting a second image of the user from the third video data; and verifying the legality of the client according to the second image.
An embodiment of the present application provides a data processing apparatus, including:
the data acquisition module is used for receiving a service handling request sent by a client, wherein the service handling request comprises first voice data and first video data;
the voice matching module is used for acquiring first semantic information from the first voice data and determining second voice data matched with the first semantic information according to the first semantic information; the second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords;
the video matching module is used for determining second video data matched with the second voice data from a video database, and the video database comprises a plurality of video data;
and the service processing module is used for synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to first multimedia data replied by the client, wherein the first multimedia data is reply data of the client aiming at the synthesized data.
Optionally, the voice matching module is specifically configured to: converting the first voice data to obtain first text data corresponding to the first voice data; extracting keywords from the first text data to obtain at least two keywords; acquiring word meaning information of the at least two keywords; obtaining a word combination according to the at least two keywords, and obtaining combined word meaning information of the word combination; determining the incidence relation between the at least two keywords according to the word sense information and the combined word sense information; and determining semantic information corresponding to the first text data according to the at least two keywords and the incidence relation between the at least two keywords as the first semantic information.
Optionally, the voice matching module is specifically configured to: acquiring similarity between the at least two keywords and a plurality of text data in the corpus to obtain a plurality of first similarities; determining text data corresponding to the association relationship according to the association relationship between the at least two keywords, and obtaining the similarity between the text data corresponding to the association relationship and the plurality of text data to obtain a plurality of second similarities; determining text data corresponding to the maximum similarity among the plurality of first similarities as first target text data, and determining text data corresponding to the maximum similarity among the plurality of second similarities as second target text data; determining the second text data according to the first target text data and the second target text data; and converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
Optionally, the video matching module is specifically configured to: obtaining a semantic scene of the second text data and an application scene of each video data in the video database; determining a target application scene matched with the semantic scene from the application scenes; and determining the video data corresponding to the target application scene as the second video data.
Optionally, the service processing module is specifically configured to: acquiring the voice time length of the second voice data; intercepting video data with the same time length as the voice time length from the second video data to serve as candidate video data; and synthesizing the second voice data and the candidate video data to obtain the synthesized data.
Optionally, the apparatus further comprises: the legality verifying module is used for intercepting a first image of a user corresponding to the client from the first video data; verifying the legality of the client according to the first image; if the client side has validity, executing the step of carrying out service processing according to the first multimedia data replied by the client side.
Optionally, the apparatus further comprises: the information adjusting module is used for sending adjusting information for indicating the user to carry out posture adjustment to the client if the client does not have legality; acquiring second multimedia data sent by the client aiming at the adjustment information, wherein the second multimedia data comprises third video data; intercepting a second image of the user from the third video data; and verifying the legality of the client according to the second image.
One aspect of the present application provides a computer device, comprising: a processor, a memory, a network interface;
the processor is connected to a memory and a network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in the aspect in the embodiment of the present application.
An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program including program instructions, which, when executed by a processor, cause the processor to execute the data processing method of the first aspect.
In the embodiment of the application, a service handling request sent by a client is received, wherein the service handling request comprises first voice data and first video data; the first semantic information is acquired from the first voice data, and the second voice data matched with the first semantic information is determined according to the first semantic information. Because the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords, the obtained first semantic information can more accurately represent the meaning of the first voice data, namely, the voice recognition accuracy is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. And determining second video data matched with the second voice data from the video database, synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to the first multimedia data replied by the client. The efficiency of service processing can be improved by processing and replying the voice data sent by the user in real time. Through synthesizing second voice data and second video data and then sending to the client, the user can see that video robot carries out video conversation with oneself through the client for human-computer interaction is more natural, thereby increases human-computer interaction's interest, promotes user experience.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a block diagram of an information handling system according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Among the key technologies of Speech processing Technology (Speech Technology) are Automatic Speech Recognition (ASR), Speech synthesis (Text To Speech, TTS) and voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
The technical scheme of the application is suitable for remote service handling scenes such as remote face examination, video return visit, remote account opening and the like. The method comprises the steps of acquiring first semantic information from first voice data by utilizing a voice processing technology, determining second voice data matched with the first semantic information according to the first semantic information, determining second video data matched with the second voice data from a video database, carrying out synthesis processing on the second video data and the second voice data to obtain synthetic data, sending the synthetic data to a client, and carrying out service processing according to first multimedia data replied by the client. According to the method and the device, when the first semantic information is acquired from the first voice data, the first semantic information is acquired by combining the keywords in the text data corresponding to the first voice data and the relation among the keywords, so that the meaning of the first voice data can be more accurately reflected by the first semantic information, namely the voice recognition accuracy is higher, and the second voice data can be more accurately replied according to the first voice data. This application can be applicable to fields such as wisdom government affairs, wisdom education, is favorable to promoting the construction in wisdom city.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an information processing system according to an embodiment of the present application, where the schematic structural diagram of the system includes a client 11 and a service processing server 12 corresponding to a service processing platform. The client 11 may refer to a terminal that sends a service handling request; the service processing server 12 may refer to a backend service device performing information processing, and the information processing may include acquiring first semantic information in the first voice data, determining second voice data matching the first semantic information according to the first semantic information, determining second video data matching the second voice data from a video database, and performing synthesis processing on the second video data and the second voice data to obtain synthesized data, and so on.
The service processing server 12 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. When the service processing server 12 is an independent physical server, the server can independently perform information processing; when the business processing server 12 is a plurality of physical servers, information processing can be performed by cooperation of the plurality of physical servers. For example, one of the servers may obtain first semantic information in the first voice data, another server may determine second voice data matching the first semantic information according to the first semantic information, and another server may perform synthesis processing on the second video data and the second voice data to obtain synthesized data. The client 12 may be a computer device, including a mobile phone, a tablet computer, a notebook computer, a palm top computer, a smart audio, a Mobile Internet Device (MID), a POS (Point Of Sales) machine, a wearable device (e.g., a smart watch, a smart bracelet, etc.), and the like. The number of the clients 12 may be one or more, and the embodiment of the present application is illustrated with one client, and for a plurality of clients, processing may be performed in this manner. It should be noted that the service processing may also be executed by a client, and a manner of performing the service processing by the client may refer to a manner of performing the service processing by a service processing server, which is described below by taking the service processing server as an example.
In practical applications, for example, a user needs to handle a service, first, the user may send a service handling request to the service handling server through the client, where the service handling request may include first voice data and first video data. Then, the service processing server receives a service handling request sent by the client, and obtains first semantic information from the first voice data, where the first semantic information may include a service identifier, and the service identifier may be, for example, a service name, a service code, and the like. And, the first semantic information may reflect at least two keywords in the first text data corresponding to the first voice data and a connection between the at least two keywords. That is, the first semantic information is determined based on the at least two keywords and the association between the at least two keywords. Then, the service processing server determines second video data matching the second voice data from the video database. For example, the second voice data is voice data indicating a process of performing a certain business transaction by a user, and the second video data may be silent video data such as opening and closing of a mouth, rotation of eyes, and smiling of a face when the video robot simulates human speech. Alternatively, the second voice data is voice data indicating that the user has failed the current network connection, and the second video data may be silent video data in which the video robot simulates human frustration, disappointment, and the like. And the service processing server synthesizes the silent second video data and the second voice data to obtain synthesized data, and sends the synthesized data to the client, so that the user can see the video robot to talk with the user through the client. Therefore, by synthesizing the video data and the voice data, the real-time video call with the user of the client can be realized, and the user experience is improved. When the first semantic information in the first voice data is acquired, each keyword in the text data corresponding to the first voice data is extracted, and the incidence relation among the keywords is acquired, so that the meaning of the first voice data can be determined more accurately, namely, the voice recognition accuracy is higher, the second voice data can be replied more accurately, and the accuracy and the flexibility of the video robot call are improved.
Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:
s101, receiving a service transaction request sent by a client.
The service handling request comprises first voice data and first video data, the first voice data is obtained by recording voice of a user through a client, and the first video data is obtained by recording video of data such as posture, expression and action of the user through the client. The client may refer to a terminal used by a user for performing service processing. The first voice data may include a service identifier, and the service identifier may include a service name, a service name abbreviation, a service code, and the like for uniquely indicating an identifier of the service. The business refers to the type of service that the user needs to transact, such as purchasing insurance, banking loan, banking card transaction, credit card transaction, and so on. Alternatively, the service may include services required by the subscriber, such as bank card balance inquiry, credit line inquiry, and the like.
Optionally, the user may send a call request through the client, the service processing server obtains the call request, establishes a call connection with the client according to the call request, and receives a service handling request sent by the client through the call connection.
Here, the service processing server may correspond to a plurality of video extensions, each video extension has the same function, and the call request may include a video extension number, and then allocates a video extension corresponding to the video extension number to the client according to the video extension number, thereby implementing call connection with the client. If the call request does not contain the video extension number, an idle video extension can be allocated to the client. Specifically, the video extensions may be matched for the call request sent by the client according to the waiting time of each idle video extension, for example, one idle video extension with the longest waiting time may be matched to the client, or one idle video extension with the shortest waiting time may be matched to the client, and so on. By distributing the idle video extension sets for the client, the efficiency of establishing call connection with the client can be improved, and the subsequent service processing efficiency is improved. The call connection includes a video connection, a voice connection, and the like. The video connection is used for acquiring video data sent by the client, and the voice connection is used for acquiring voice data sent by the client. If the call connection is a voice connection, the client may send the first video data to the data processing server in other manners, for example, the first video data may be sent to the data processing server in a video transmission manner.
In a specific implementation, when the client establishes a call connection with the service processing server, the client may send a call request to the service processing server, and may verify the call request through network switching, a firewall, and the like, for example, to verify whether the call request is safe, whether a virus is carried, whether the call request is in a format recognizable by the service processing server, and the like, and establish a call connection with the service processing server after the verification is passed.
Optionally, after the call connection is established between the client and the service processing server, the service processing server may send pre-stored voice data representing a welcome user to the client, and the user may determine that the call connection is successfully established according to the voice data, thereby sending a service handling request. For example, the pre-stored voice data indicating a welcome user may be "do you, welcome incoming call, ask what can be served to you". Optionally, under the condition that the storage space is saved, the text data representing the welcome user may be stored in the service processing server, and the service processing server converts the text data into the voice data and then sends the voice data to the client.
S102, first semantic information is obtained from the first voice data, and second voice data matched with the first semantic information is determined according to the first semantic information.
The second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords. By acquiring at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords, compared with the method for determining the meaning of the first voice data only according to the keywords, the method not only combines the keywords, but also combines the incidence relation between the keywords, so that the meaning of the first voice data can be determined more accurately, the determined first semantic information is more accurate, namely, the voice recognition accuracy is higher, and the matched second voice data is more accurate.
Optionally, one method for obtaining the first semantic information from the first voice data may be: converting the first voice data to obtain first text data corresponding to the first voice data; extracting keywords from the first text data to obtain at least two keywords; acquiring word meaning information of at least two keywords; obtaining a word combination according to at least two keywords, and obtaining combined word meaning information of the word combination; determining an incidence relation between at least two keywords according to the word sense information and the combined word sense information; and determining semantic information corresponding to the first text data according to the at least two keywords and the incidence relation between the at least two keywords to serve as the first semantic information.
Here, since the first voice data is voice-type data, the voice-type data can be converted into text-type data to obtain the first text data. At least two keywords are obtained by extracting the keywords from the first text data. Optionally, the service processing platform may perform word segmentation processing on the first text data, and divide the first text data into at least one word segment; acquiring a stop word set, wherein the stop word set comprises at least one word irrelevant to a service; searching a target word matched with the at least one participle in the stop word set; deleting a target word in the at least one word segmentation; and extracting keywords from at least one segmented word after the target word is deleted to obtain at least two keywords.
For example, the first text data is "i want to purchase vehicle insurance but requires loan first due to insufficient funds", the result of the segmentation processing is "i want to purchase vehicle insurance but requires loan first due to insufficient funds", so that 10 segmentations are divided, then the 10 segmentations are respectively matched with each stop word in the stop word set, if the 4 segmentations of "i", "want", "but" and "need" are matched, the 4 segmentations are deleted, so that "purchase vehicle insurance fund first loan" is obtained, and keyword extraction is performed on the "purchase vehicle insurance fund first loan" to obtain 6 keywords of "purchase", "vehicle insurance", "funds", "insufficient", "first loan" and "first loan". In a specific implementation, ASR technology or other technology may be used to convert the speech data into text data, so as to extract keywords in the text data.
After keyword extraction is carried out on the first text data to obtain at least two keywords, word sense information of the 6 keywords is obtained; obtaining word combinations according to the 6 keywords, for example, the word combinations are 'car insurance fund shortage loan first', and combined word meaning information of the word combinations is obtained; determining the incidence relation between at least two keywords according to the word sense information and the combined word sense information of the 6 keywords, wherein the meaning of the first text data is 'loan first and then car insurance buying'; and determining semantic information corresponding to the first text data according to the at least two keywords and the incidence relation between the at least two keywords, wherein the semantic information is used as first semantic information, and the first semantic information is loan and car insurance purchasing.
Optionally, another method for acquiring the first semantic information from the first voice data may be: and identifying the first voice data to obtain at least two keywords in the first voice data. For example, the first voice data is: and if the recognized at least two keywords comprise 'purchase', 'vehicle insurance', 'fund', 'deficiency', 'first' and 'loan', and the first semantic information is 'loan and purchase vehicle insurance'. Specifically, the first voice data can be converted into text data according to specific requirements for keyword extraction, or the first voice data is subjected to voice recognition to obtain at least two keywords. For example, if the cost of voice recognition is low, then under the condition of saving cost, voice recognition is adopted; or, if the accuracy of extracting the keywords by converting the voice data into the text data is high, extracting the keywords by converting the voice data into the text data under the condition of improving the recognition accuracy.
Optionally, the method for determining the second voice data matched with the first semantic information according to the first semantic information may include the following steps:
firstly, the similarity between at least two keywords and a plurality of text data in a corpus is obtained, and a plurality of first similarities are obtained.
Here, the corpus is a database corresponding to the business processing server, and the corpus may include text data related to business handling, such as specific process information of handling business; text data unrelated to business transactions may also be included. In a specific implementation, a similarity calculation method may be used to calculate similarities between each keyword and each text data in the corpus so as to obtain a plurality of first similarities, where the similarity calculation method may include a pearson correlation coefficient method, a Cosine similarity method, and the like, and is not limited herein.
Determining text data corresponding to the association relationship according to the association relationship between the at least two keywords, and obtaining the similarity between the text data corresponding to the association relationship and the plurality of text data to obtain a plurality of second similarities.
Here, since the association relationship between the at least two keywords refers to the association between the at least two keywords, for example, the order of the services corresponding to the keywords may be represented, and text data corresponding to the association relationship may be determined according to the association, for example, the at least two keywords include "purchase", "car insurance", "fund", "shortage", "first" and "loan", where, without considering the association between the keywords, the obtained text data including the services is "car insurance loan purchase", and, with considering the association between the keywords, the obtained text data corresponding to the association including the services is "car insurance purchase loan", that is, after the loan service is processed first, the car insurance purchase service is processed. In a specific implementation, a similarity calculation method may be used to calculate similarities between the text data corresponding to the association relationship and the plurality of text data in the corpus so as to obtain a plurality of second similarities, where the similarity calculation method may include a pearson correlation coefficient method, a Cosine similarity method, and the like, and is not limited herein.
And thirdly, determining the text data corresponding to the maximum similarity in the plurality of first similarities as first target text data, and determining the text data corresponding to the maximum similarity in the plurality of second similarities as second target text data.
Here, since there are a plurality of text data in the corpus, a similarity between each of the at least two keywords and each of the plurality of text data in the corpus is calculated to obtain a first similarity, and thus a plurality of first similarities can be obtained from the at least two keywords and the plurality of text data. For example, if the number of keywords is n1, and the number of text data in the corpus is m1, then n1 × m1 first similarities can be calculated, and if the number of corresponding text data in the association relationship is n2, and if the number of text data in the corpus is m1, then n2 × m1 second similarities can be calculated. Then the magnitudes of n1 × m1 first similarities may be compared, the largest similarity among n1 × m1 first similarities may be determined, and the magnitudes of n2 × m1 second similarities may be compared, the largest similarity among n2 × m1 second similarities may be determined, so that text data corresponding to the largest similarity among the plurality of first similarities may be determined as the first target text data, and text data corresponding to the largest similarity among the plurality of second similarities may be determined as the second target text data.
And fourthly, determining second text data according to the first target text data and the second target text data.
Here, if the first target text data is the same as the second target text data, the first target text data may be determined as the second text data; if the first target text data is not the same as the second target text data, the second target text data may be determined as the second text data.
And fifthly, converting the second text data to obtain voice data corresponding to the second text data as second voice data.
In a specific implementation, a Natural Language Processing (NLP) technique may be used to process the text data, obtain the first semantic information, and determine the second speech data matching the first semantic information. Here, TTS technology or other technology may be employed to convert text data into voice data. By converting the second text data into the second voice data, the user can acquire the second voice data through the client, so that the first multimedia data is replied according to the second voice data, and the service processing platform processes the corresponding service according to the first multimedia data replied by the user. The second text data is converted into the voice data, so that a user can more visually know the content of the voice data, and compared with a mode that the user directly checks the second text data, the mode that the text data is converted into the voice data can improve the checking efficiency of the user, and therefore the service processing efficiency is improved.
And S103, determining second video data matched with the second voice data from the video database.
Here, the video database includes a plurality of video data. For example, various types of video data may be included, and in particular, silent video data including mouth opening, mouth closing, eye rotation, and facial smile when the video robot simulator speaks, or silent video data including depression, disappointment, etc. of the video robot simulator, or silent video data including expressions of guilt, etc. of the video robot simulator, etc. In specific implementation, the multiple types of video data can be stored in the video database in advance, so that subsequent use is facilitated.
Optionally, the method for determining the second video data matching the second voice data from the video database may be: acquiring a semantic scene of the second text data and an application scene of each video data in the video database; determining a target application scene matched with the semantic scene from the application scenes; and determining the video data corresponding to the target application scene as second video data.
The second text data is text data corresponding to the second voice data, that is, the second text data and the second voice data have the same meaning but different expression forms, the expression form of the second text data is a character form, and the expression form of the second voice data is a sound form. The semantic scene of the second text data refers to the meaning of the second text data, and may include, for example, a specific business handling process, indication information indicating that a user waits, handling failure prompt information, and the like, and the handling failure may include, for example, a network connection failure, a server being busy, and the like. The application scene of the video data may be determined according to the type of the video data. For example, if the semantic scene is a specific business handling process, the target application scene matched with the semantic scene may be silent video data such as mouth opening, mouth closing, eye rotation, face smiling, and the like when the video robot simulates human speaking. Or the semantic scene is indication information indicating that the user waits, the target application scene matched with the semantic scene can be silent video data of the video robot simulating the expressions of guilt, etc. Or the semantic scene is fault prompt information, the target application scene matched with the semantic scene can simulate silent video data such as human depression and disappointment for the video robot. By determining the target application scene matched with the semantic scene, the video data corresponding to the target application scene is determined as the second video data, so that the synthetic data seen by subsequent users is more natural, and the interestingness of man-machine interaction is improved.
And S104, synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to the first multimedia data replied by the client.
Wherein the first multimedia data is reply data of the client to the synthesized data. The service processing server sends the synthesized data to the client, and after the user learns the synthesized data through the client, the user can perform corresponding reply according to the synthesized data, for example, answer a question in the synthesized data, fill in the identity information of the user according to the prompt information in the synthesized data, upload a corresponding identity document, and the like. The client acquires data replied by the user according to the synthetic data, for example, recording voice replied by the user, and recording the action and expression of the user to obtain first multimedia data. The client sends the first multimedia data to the service processing server, and the service processing server performs service processing according to the first multimedia data.
Optionally, the method for synthesizing the second video data and the second voice data to obtain synthesized data may be: acquiring the voice duration of the second voice data; intercepting video data with the same time length as the voice time length from the second video data to serve as candidate video data; and synthesizing the second voice data and the candidate video data to obtain synthesized data.
Here, if the video duration of the second video data is equal to the voice duration of the second voice data, the second video data is determined as the candidate video data. If the video duration of the second video data is greater than the voice duration of the second voice data, video data with the voice duration equal to that of the second voice data can be intercepted from the second video data to serve as candidate video data. For example, if the video duration of the second video data is 5 seconds and the voice duration of the second voice data is 3 seconds, the 3 seconds video data may be obtained from the 5 seconds second video data clock as the candidate video data. If the video duration of the second video data is less than the voice duration of the second voice data, the plurality of second video data may be connected to obtain video data having the same voice duration as the second voice data, and the video data is used as candidate video data. For example, if the video duration of the second video data is 3 seconds and the voice duration of the second voice data is 6 seconds, the same 3-second video data may be concatenated to obtain 6 seconds of video data as the candidate video data. The second voice data and the candidate video data may be synthesized to obtain video data with sound as synthesized data, and the sound of the synthesized data is the second voice data. Or the second voice data and the candidate video data are simultaneously sent to the client, and the client simultaneously plays the second voice data and the candidate video data, so that the synthesized data comprises the second voice data and the candidate video data. Therefore, when the service processing server sends the voice data to the client, the video data matched with the voice data is determined from the video database, so that synthesized data is obtained according to the voice data and the video data and then sent to the client, a conversation with the client is formed, and man-machine interaction is more natural.
In the embodiment of the application, a service handling request sent by a client is received, wherein the service handling request comprises first voice data and first video data; the first semantic information is acquired from the first voice data, and the second voice data matched with the first semantic information is determined according to the first semantic information. Because the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords, the obtained first semantic information can more accurately represent the meaning of the first voice data, namely, the voice recognition accuracy is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. And determining second video data matched with the second voice data from the video database, synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to the first multimedia data replied by the client. The efficiency of service processing can be improved by processing and replying the voice data sent by the user in real time. Through synthesizing second voice data and second video data and then sending to the client, the user can see that video robot carries out video conversation with oneself through the client for human-computer interaction is more natural, thereby increases human-computer interaction's interest, promotes user experience.
Optionally, please refer to fig. 3, where fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 3, the method includes:
s201, receiving a service transaction request sent by a client.
S202, first semantic information is obtained from the first voice data, and second voice data matched with the first semantic information is determined according to the first semantic information.
And S203, determining second video data matched with the second voice data from the video database.
And S204, synthesizing the second video data and the second voice data to obtain synthesized data, and sending the synthesized data to the client.
Here, the specific implementation manner of steps S201 to S204 may refer to the description of steps S101 to S104 in the embodiment corresponding to fig. 2, and is not described herein again.
S205, a first image of the user corresponding to the client is intercepted from the first video data.
Here, the first video data is obtained by video recording data such as a gesture, an expression, and an action of the user by the client. It is known that the first video data includes a face image of the user. The service processing server can intercept the first video data at preset intervals to obtain a first image containing the face of the user, namely the first image of the user corresponding to the client. For example, the image in the first video data may be cut every 0.5 second, resulting in a first image. For example, if the duration of the first video data is 2 seconds, the number of the first images acquired by the user is 4.
And S206, verifying the legality of the client according to the first image.
Here, verifying the validity of the client according to the first image may refer to verifying whether the face image of the user in the first image is the same as the face image of the user stored by the business processing server, and if so, determining that the client has validity. If not, determining that the client side does not have legality, and sending adjustment information for indicating the user to perform posture adjustment to the client side; acquiring second multimedia data sent by the client aiming at the adjustment information; intercepting a second image of the user from the third video data; and verifying the legality of the client according to the second image.
The second multimedia data comprises third video data, and the face information of the user stored by the service processing server can be according to the face information stored by the user in the historical service transacted by the service processing server. For example, if the user transacts a bank card in the business processing server, the face information of the user stored in the business processing server may be the face information of the user that was reserved when the user transacts the bank card in the business processing server. If the user does not handle the historical service in the service processing server or does not store the face information when the user handles the historical service in the service processing server, the face information of the user can be acquired from other servers in which the face information of the user is stored, for example, the face information of the user can be acquired from servers corresponding to institutions such as the public security department and the civil administration department. And when the business processing server verifies that the face image of the user in the first image is not the same as the face image of the user stored by the business processing server, determining that the client does not have legality, and sending adjustment information for instructing the user to perform posture adjustment to the client so that the user performs posture adjustment according to the adjustment information. For example, when the face of the user is not aligned with the camera of the client, the adjusted face of the user is aligned with the camera of the client; or, in the case that the camera of the client includes the user a and the user B, and the user a is a user needing to handle the service, the adjusted camera of the client includes only the user a.
Specifically, the service processing server acquires second multimedia data sent by the client aiming at the adjustment information, wherein the second multimedia data comprises third video data; acquiring a second image of the user according to the third video data; intercepting a second image of the user from the third video data; and verifying the legality of the client according to the second image. The second image comprises a face image of the user, if the second image and the face image of the user stored in the service processing server are the face image of the same user, the client side is legal, and service processing is carried out according to the first multimedia data replied by the client side. And if the second image is not the same as the facial image of the user stored in the service processing server, the client does not have legality, the service processing is finished, the output is used for indicating the user to perform service processing in a manual service processing position corresponding to the service processing server, and the service processing is finished.
And S207, if the client side has the legality, performing service processing according to the first multimedia data replied by the client side.
Here, if the client is legal, that is, the second image is the face image of the same user as the face image of the user stored in the service processing server, the service processing is performed according to the first multimedia data returned by the client.
In a possible implementation manner, before performing the service processing, when the service processing server may further perform secondary verification on the validity of the client according to the first multimedia data, for example, a third image when the user answers the question may be obtained from the first multimedia data, for example, there are 3 questions in the synthesized data, and the image when the user answers the 3 questions is intercepted from the first multimedia data, so as to obtain a third image including the face image of the user. And performing micro-expression recognition on the third image, so as to determine the authenticity of the question answered by the user according to the micro-expression when the user answers the question. And if the truth of the questions answered by the user is determined to be high through micro expression recognition, processing the service. And if the authenticity of the question answered by the user is determined to be low through micro-expression recognition, sending indication information for secondarily verifying the identity of the user or outputting the question with abnormal micro-expressions when the user answers the question again. And if the secondary verification is passed or the expression when the user answers the question again indicates that the authenticity of the question answered by the user is higher, processing the service. And if the secondary verification fails or the expression when the user answers the question again indicates that the authenticity of the question answered by the user is low, outputting the result to indicate the user to perform service handling in a manual service handling process corresponding to the service handling server, and finishing the service handling. The micro expression recognition can be used for verifying the authenticity of the content of the user speaking, so that the accuracy of data recognition can be improved. The authenticity of the user identity can be improved by acquiring the first image in the first video data and sending the first image to the service processing server for verification, and the service processing server can identify the authenticity of the question answered by the user by performing micro-expression recognition on the third image in the first multimedia data, so that the identity information of the user can be verified secondarily, and the accuracy of service processing is improved.
In the embodiment of the application, before the service processing, the validity of the client is verified, namely the identity information of the user is verified, and under the condition that the validity of the client is verified, the identity information of the user is real, the corresponding service processing is performed; under the condition that the client is not verified to be legal, the client is prompted to perform posture adjustment through outputting the adjustment information, the client legality is verified, the authenticity of user identity information verification can be improved, and therefore the accuracy of service processing is improved.
The method of the embodiments of the present application is described above, and the apparatus of the embodiments of the present application is described below.
Referring to fig. 4, fig. 4 is a schematic diagram of a component structure of a data processing apparatus provided in an embodiment of the present application, where the data processing apparatus may be a computer program (including program code) running in a computer device, for example, the data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The apparatus 40 comprises:
a data obtaining module 401, configured to receive a service handling request sent by a client, where the service handling request includes first voice data and first video data;
a voice matching module 402, configured to obtain first semantic information from the first voice data, and determine, according to the first semantic information, second voice data matched with the first semantic information; the second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords;
a video matching module 403, configured to determine second video data matching the second voice data from a video database, where the video database includes a plurality of video data;
a service processing module 404, configured to synthesize the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to first multimedia data replied by the client, where the first multimedia data is reply data of the client for the synthesized data.
Optionally, the voice matching module 402 is specifically configured to:
converting the first voice data to obtain first text data corresponding to the first voice data;
extracting keywords from the first text data to obtain at least two keywords;
acquiring word meaning information of the at least two keywords;
obtaining a word combination according to the at least two keywords, and obtaining combined word meaning information of the word combination;
determining the incidence relation between the at least two keywords according to the word sense information and the combined word sense information;
and determining semantic information corresponding to the first text data according to the at least two keywords and the incidence relation between the at least two keywords as the first semantic information.
Optionally, the voice matching module 402 is specifically configured to:
acquiring similarity between the at least two keywords and a plurality of text data in the corpus to obtain a plurality of first similarities;
determining text data corresponding to the association relationship according to the association relationship between the at least two keywords, and obtaining the similarity between the text data corresponding to the association relationship and the plurality of text data to obtain a plurality of second similarities;
determining text data corresponding to the maximum similarity among the plurality of first similarities as first target text data, and determining text data corresponding to the maximum similarity among the plurality of second similarities as second target text data;
determining the second text data according to the first target text data and the second target text data;
and converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
Optionally, the video matching module 403 is specifically configured to:
obtaining a semantic scene of the second text data and an application scene of each video data in the video database;
determining a target application scene matched with the semantic scene from the application scenes;
and determining the video data corresponding to the target application scene as the second video data.
Optionally, the service processing module 404 is specifically configured to:
acquiring the voice time length of the second voice data;
intercepting video data with the same time length as the voice time length from the second video data to serve as candidate video data;
and synthesizing the second voice data and the candidate video data to obtain the synthesized data.
Optionally, the apparatus 40 further comprises: a validity verification module 405 for:
intercepting a first image of a user corresponding to the client from the first video data;
verifying the legality of the client according to the first image;
if the client side has validity, executing the step of carrying out service processing according to the first multimedia data replied by the client side.
Optionally, the apparatus 40 further comprises: an information adjustment module 406, configured to:
if the client side does not have the legality, sending adjustment information for indicating the user to perform posture adjustment to the client side;
acquiring second multimedia data sent by the client aiming at the adjustment information, wherein the second multimedia data comprises third video data;
intercepting a second image of the user from the third video data;
and verifying the legality of the client according to the second image.
It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 4, reference may be made to the description of the method embodiment, and details are not described here again.
In the embodiment of the application, a service handling request sent by a client is received, wherein the service handling request comprises first voice data and first video data; the first semantic information is acquired from the first voice data, and the second voice data matched with the first semantic information is determined according to the first semantic information. Because the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords, the obtained first semantic information can more accurately represent the meaning of the first voice data, namely, the voice recognition accuracy is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. And determining second video data matched with the second voice data from the video database, synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to the first multimedia data replied by the client. The efficiency of service processing can be improved by processing and replying the voice data sent by the user in real time. Through synthesizing second voice data and second video data and then sending to the client, the user can see that video robot carries out video conversation with oneself through the client for human-computer interaction is more natural, thereby increases human-computer interaction's interest, promotes user experience.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 5, the computer device 50 may include: the processor 501, the network interface 504 and the memory 505, and the computer device 50 may further include: a user interface 503, and at least one communication bus 502. Wherein a communication bus 502 is used to enable connective communication between these components. The user interface 503 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 503 may also include a standard wired interface and a standard wireless interface. The network interface 504 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 405 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 505 may alternatively be at least one memory device located remotely from the processor 501. As shown in fig. 5, the memory 505, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the computer device 50 shown in fig. 5, the network interface 504 may provide network communication functions; while the user interface 503 is primarily an interface for providing input to a user; and processor 501 may be used to invoke a device control application stored in memory 505 to implement:
receiving a service handling request sent by a client, wherein the service handling request comprises first voice data and first video data;
acquiring first semantic information from the first voice data, and determining second voice data matched with the first semantic information according to the first semantic information; the second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords;
determining second video data matched with the second voice data from a video database, wherein the video database comprises a plurality of video data;
and synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to first multimedia data replied by the client, wherein the first multimedia data is reply data of the client aiming at the synthesized data.
It should be understood that the computer device 50 described in this embodiment may perform the description of the data processing method in the embodiment corresponding to fig. 2 and fig. 3, and may also perform the description of the data processing apparatus in the embodiment corresponding to fig. 4, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
In the embodiment of the application, a service handling request sent by a client is received, wherein the service handling request comprises first voice data and first video data; the first semantic information is acquired from the first voice data, and the second voice data matched with the first semantic information is determined according to the first semantic information. Because the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data and the incidence relation between the at least two keywords, the obtained first semantic information can more accurately represent the meaning of the first voice data, namely, the voice recognition accuracy is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. And determining second video data matched with the second voice data from the video database, synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to the first multimedia data replied by the client. The efficiency of service processing can be improved by processing and replying the voice data sent by the user in real time. Through synthesizing second voice data and second video data and then sending to the client, the user can see that video robot carries out video conversation with oneself through the client for human-computer interaction is more natural, thereby increases human-computer interaction's interest, promotes user experience.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions, which, when executed by a computer, cause the computer to perform the method according to the foregoing embodiments, and the computer may be a part of the above-mentioned computer device. Such as processor 501 described above. By way of example, the program instructions may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (10)

1. A data processing method, comprising:
receiving a service handling request sent by a client, wherein the service handling request comprises first voice data and first video data;
acquiring first semantic information from the first voice data, and determining second voice data matched with the first semantic information according to the first semantic information; the second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in first text data corresponding to the first voice data and the incidence relation between the at least two keywords;
determining second video data matched with the second voice data from a video database, wherein the video database comprises a plurality of video data;
and synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to first multimedia data replied by the client, wherein the first multimedia data is reply data of the client aiming at the synthesized data.
2. The method of claim 1, wherein the obtaining the first semantic information from the first voice data comprises:
converting the first voice data to obtain first text data corresponding to the first voice data;
extracting keywords from the first text data to obtain at least two keywords;
acquiring word meaning information of the at least two keywords;
obtaining a word combination according to the at least two keywords, and obtaining combined word meaning information of the word combination;
determining an incidence relation between the at least two keywords according to the word sense information and the combined word sense information;
and determining semantic information corresponding to the first text data according to the at least two keywords and the incidence relation between the at least two keywords to serve as the first semantic information.
3. The method of claim 2, wherein determining second speech data that matches the first semantic information based on the first semantic information comprises:
acquiring similarity between the at least two keywords and a plurality of text data in the corpus to obtain a plurality of first similarities;
determining text data corresponding to the association relationship according to the association relationship between the at least two keywords, and obtaining the similarity between the text data corresponding to the association relationship and the plurality of text data to obtain a plurality of second similarities;
determining text data corresponding to the maximum similarity in the first similarities as first target text data, and determining text data corresponding to the maximum similarity in the second similarities as second target text data;
determining the second text data according to the first target text data and the second target text data;
and converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
4. The method of claim 3, wherein determining second video data from a video database that matches the second voice data comprises:
obtaining a semantic scene of the second text data and an application scene of each video data in the video database;
determining a target application scene matched with the semantic scene from the application scenes;
and determining the video data corresponding to the target application scene as the second video data.
5. The method according to claim 1, wherein the synthesizing the second video data and the second voice data to obtain synthesized data comprises:
acquiring the voice duration of the second voice data;
intercepting video data with the same duration as the voice time from the second video data to serve as candidate video data;
and synthesizing the second voice data and the candidate video data to obtain the synthesized data.
6. The method of claim 1, further comprising:
intercepting a first image of a user corresponding to the client from the first video data;
verifying the legality of the client according to the first image;
and if the client side has legality, executing the step of carrying out service processing according to the first multimedia data replied by the client side.
7. The method of claim 6, further comprising:
if the client side does not have legality, sending adjustment information for indicating the user to perform posture adjustment to the client side;
acquiring second multimedia data sent by the client aiming at the adjustment information, wherein the second multimedia data comprises third video data;
intercepting a second image of the user from the third video data;
and verifying the legality of the client according to the second image.
8. A data processing apparatus, comprising:
the data acquisition module is used for receiving a service handling request sent by a client, wherein the service handling request comprises first voice data and first video data;
the voice matching module is used for acquiring first semantic information from the first voice data and determining second voice data matched with the first semantic information according to the first semantic information; the second voice data is reply data aiming at the first voice data, and the first semantic information is used for reflecting at least two keywords in first text data corresponding to the first voice data and the incidence relation between the at least two keywords;
the video matching module is used for determining second video data matched with the second voice data from a video database, and the video database comprises a plurality of video data;
and the service processing module is used for synthesizing the second video data and the second voice data to obtain synthesized data, sending the synthesized data to the client, and performing service processing according to second multimedia data replied by the client, wherein the second multimedia data is reply data of the client aiming at the synthesized data.
9. A computer device, comprising: a processor, a memory, and a network interface;
the processor is connected to the memory and the network interface, wherein the network interface is configured to provide data communication functions, the memory is configured to store program code, and the processor is configured to call the program code to perform the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.
CN202011006333.7A 2020-09-22 2020-09-22 Data processing method, device, equipment and medium Pending CN112131365A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011006333.7A CN112131365A (en) 2020-09-22 2020-09-22 Data processing method, device, equipment and medium
PCT/CN2020/124256 WO2021159734A1 (en) 2020-09-22 2020-10-28 Data processing method and apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011006333.7A CN112131365A (en) 2020-09-22 2020-09-22 Data processing method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN112131365A true CN112131365A (en) 2020-12-25

Family

ID=73842593

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011006333.7A Pending CN112131365A (en) 2020-09-22 2020-09-22 Data processing method, device, equipment and medium

Country Status (2)

Country Link
CN (1) CN112131365A (en)
WO (1) WO2021159734A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760425A (en) * 2022-03-21 2022-07-15 京东科技信息技术有限公司 Digital human generation method, device, computer equipment and storage medium
CN115022395A (en) * 2022-05-27 2022-09-06 平安普惠企业管理有限公司 Business video pushing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918913A (en) * 2017-11-20 2018-04-17 中国银行股份有限公司 Banking processing method, device and system
CN110162780A (en) * 2019-04-08 2019-08-23 深圳市金微蓝技术有限公司 The recognition methods and device that user is intended to
CN110489527A (en) * 2019-08-13 2019-11-22 南京邮电大学 Banking intelligent consulting based on interactive voice and handle method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008216461A (en) * 2007-03-01 2008-09-18 Nec Corp Speech recognition, keyword extraction, and knowledge base retrieval coordinating device
CN108090170B (en) * 2017-12-14 2019-03-26 南京美桥信息科技有限公司 A kind of intelligence inquiry method for recognizing semantics and visible intelligent interrogation system
CN109241332B (en) * 2018-10-19 2021-09-24 广东小天才科技有限公司 Method and system for determining semantics through voice
KR102181583B1 (en) * 2018-12-28 2020-11-20 수상에스티(주) System for voice recognition of interactive robot and the method therof
CN110405791B (en) * 2019-08-16 2020-03-31 江苏遨信科技有限公司 Method and system for simulating and learning speech by robot

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918913A (en) * 2017-11-20 2018-04-17 中国银行股份有限公司 Banking processing method, device and system
CN110162780A (en) * 2019-04-08 2019-08-23 深圳市金微蓝技术有限公司 The recognition methods and device that user is intended to
CN110489527A (en) * 2019-08-13 2019-11-22 南京邮电大学 Banking intelligent consulting based on interactive voice and handle method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760425A (en) * 2022-03-21 2022-07-15 京东科技信息技术有限公司 Digital human generation method, device, computer equipment and storage medium
CN115022395A (en) * 2022-05-27 2022-09-06 平安普惠企业管理有限公司 Business video pushing method and device, electronic equipment and storage medium
CN115022395B (en) * 2022-05-27 2023-08-08 艾普科创(北京)控股有限公司 Service video pushing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2021159734A1 (en) 2021-08-19

Similar Documents

Publication Publication Date Title
CN111488433B (en) Artificial intelligence interactive system suitable for bank and capable of improving field experience
CN111883123B (en) Conference summary generation method, device, equipment and medium based on AI identification
US9361891B1 (en) Method for converting speech to text, performing natural language processing on the text output, extracting data values and matching to an electronic ticket form
CN107707970B (en) A kind of electronic contract signature method, system and terminal
CN109514586B (en) Method and system for realizing intelligent customer service robot
WO2022095380A1 (en) Ai-based virtual interaction model generation method and apparatus, computer device and storage medium
CN111858892B (en) Voice interaction method, device, equipment and medium based on knowledge graph
CN110322317B (en) Transaction data processing method and device, electronic equipment and medium
CN109462482B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and computer readable storage medium
CN112181127A (en) Method and device for man-machine interaction
CN112131365A (en) Data processing method, device, equipment and medium
CN113873088B (en) Interactive method and device for voice call, computer equipment and storage medium
WO2018001040A1 (en) Method and device for providing service data, and computer storage medium
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN112348667A (en) Intelligent account opening method and device based on virtual customer service
CN109524009B (en) Policy entry method and related device based on voice recognition
CN112840628A (en) Evidence recording of human-computer interaction communication
CN114157763A (en) Information processing method and device in interactive process, terminal and storage medium
KR20200027090A (en) Method and interactive banking system for procession interactive financial transaction
CN114969295A (en) Dialog interaction data processing method, device and equipment based on artificial intelligence
CN112037796B (en) Data processing method, device, equipment and medium
CN109493868B (en) Policy entry method and related device based on voice recognition
CN113763925A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN112037796A (en) Data processing method, device, equipment and medium
CN113782022B (en) Communication method, device, equipment and storage medium based on intention recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination