CN112151027A - Specific person inquiry method, device and storage medium based on digital person - Google Patents

Specific person inquiry method, device and storage medium based on digital person Download PDF

Info

Publication number
CN112151027A
CN112151027A CN202010847705.2A CN202010847705A CN112151027A CN 112151027 A CN112151027 A CN 112151027A CN 202010847705 A CN202010847705 A CN 202010847705A CN 112151027 A CN112151027 A CN 112151027A
Authority
CN
China
Prior art keywords
target
specific person
voice
inquiry
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010847705.2A
Other languages
Chinese (zh)
Other versions
CN112151027B (en
Inventor
常向月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202010847705.2A priority Critical patent/CN112151027B/en
Publication of CN112151027A publication Critical patent/CN112151027A/en
Application granted granted Critical
Publication of CN112151027B publication Critical patent/CN112151027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • G10L2025/906Pitch tracking

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present application relates to a digital person-based person-specific inquiry method, apparatus and storage medium. The method comprises the following steps: acquiring a specific person voice and a specific person image corresponding to the first inquiry sentence responded by the target specific person; acquiring a plurality of behavior characteristics corresponding to a target specific person based on the specific person image to obtain a behavior characteristic set; acquiring a plurality of voice features corresponding to a target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; a target response intent corresponding to the target specific person is determined based on the first combined features. The digital person can be represented by the virtual image, and the virtual image can be adjusted according to the target response intention corresponding to the target specific person, for example, when the response intention is lie, the expression of the virtual image is adjusted to be strict. By adopting the method, the authenticity judgment accuracy of the digital person for answering the inquiry sentence of the specific person can be improved.

Description

Specific person inquiry method, device and storage medium based on digital person
Technical Field
The present application relates to the field of human-computer interaction technology, and in particular, to a method, an apparatus, and a storage medium for querying a specific person based on a digital person.
Background
With the development of scientific technology, human-computer interaction technology can be used to complete specific work in many scenarios, for example, digital human is used to provide services such as problem solution and information query for users.
At present, after receiving a sentence input by a user, a digital person can understand the sentence and determine the meaning contained in the sentence, however, the meaning contained in the sentence often does not conform to the actual situation, that is, the authenticity of the meaning expressed by the sentence cannot be determined, and the authenticity judgment accuracy is low.
Disclosure of Invention
In view of the above, it is necessary to provide a digital person-based specific person query method, apparatus and storage medium.
A digital person-based person-specific query method, the method comprising: outputting a first query statement; acquiring the voice and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person; acquiring a plurality of behavior characteristics corresponding to the target specific person based on the specific person image to obtain a behavior characteristic set; acquiring a plurality of voice features corresponding to the target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; determining a target response intent corresponding to the target specific person based on the first combined features.
In some embodiments, the obtaining a plurality of speech features corresponding to the target specific person based on the voice of the specific person, and obtaining a speech feature set includes: and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
In some embodiments, the obtaining a plurality of speech features corresponding to the target specific person based on the voice of the specific person, and obtaining a speech feature set includes: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; the method further comprises the following steps: combining the target semantics with intonation change information in the voice attribute information to obtain a second combination characteristic; the determining a target response intent corresponding to the target specific person based on the first combined feature comprises: and determining a target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
In some embodiments, the obtaining a plurality of behavior features corresponding to the target specific person based on the specific person image, and obtaining a behavior feature set includes: and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by using the trained expression recognition model to obtain the target expression corresponding to the target user.
In some embodiments, the method further comprises: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the determining a target query policy based on the target response intention corresponding to the target specific person, the querying the target specific person based on the target query policy includes: acquiring a second inquiry statement for inquiring the target specific person; determining a corresponding target query intonation according to the target reply intention; obtaining target inquiry voice according to the second inquiry statement and the target inquiry tone; and outputting the target inquiry voice.
In some embodiments, the obtaining a second query statement that queries the target specific person includes: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
In some embodiments, the obtaining of the target query voice according to the second query statement and the target query intonation includes obtaining background attribute information corresponding to the target specific person; modifying the second query statement according to the background attribute information to obtain a modified second query statement; and obtaining target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
In some embodiments, the determining a target query policy based on the target response intention corresponding to the target specific person, the querying the target specific person based on the target query policy includes: acquiring an avatar corresponding to the digital person; determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person; and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, said determining a target response intent for said target specific person based on said first combined feature comprises: and inputting the first combined feature into a trained intention recognition model, and processing the first combined feature by the intention recognition model by using a model parameter corresponding to the first combined feature to obtain a target response intention corresponding to the target specific person.
A digital person-based person-specific interrogation apparatus, the apparatus comprising: a first query statement output module for outputting a first query statement; the information acquisition module is used for acquiring the voice of the specific person and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person; a behavior feature set obtaining module, configured to obtain, based on the specific person image, a plurality of behavior features corresponding to the target specific person to obtain a behavior feature set; a voice feature set obtaining module, configured to obtain, based on the voice of the specific person, a plurality of voice features corresponding to the target specific person, so as to obtain a voice feature set; the first combination module is used for combining the features in the behavior feature set and the features in the voice feature set to obtain first combination features; and the target response intention determining module is used for determining the target response intention corresponding to the target specific person based on the first combined characteristic.
In some embodiments, the speech feature set derivation module is to: and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
In some embodiments, the speech feature set derivation module is to: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; the apparatus also includes a second assembly module for: combining the target semantics with intonation change information in the voice attribute information to obtain a second combination characteristic; the target response intent determination module is to: and determining a target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
In some embodiments, the behavior feature set derivation module is to: and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by using the trained expression recognition model to obtain the target expression corresponding to the target user.
In some embodiments, the apparatus further comprises a target interrogation policy determination module to: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the target interrogation policy determination module comprises: a second query sentence acquisition unit configured to acquire a second query sentence for querying the target specific person; the target query intonation determining unit is used for determining a corresponding target query intonation according to the target response intention; a target query voice obtaining unit, configured to obtain a target query voice according to the second query statement and the target query intonation; and the target inquiry voice output unit is used for outputting the target inquiry voice.
In some embodiments, the second query statement acquisition unit is configured to: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
In some embodiments, the target query voice obtaining unit is configured to obtain background attribute information corresponding to the target specific person; modifying the second query statement according to the background attribute information to obtain a modified second query statement; and obtaining target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
In some embodiments, the target interrogation policy determination module is to: acquiring an avatar corresponding to the digital person; determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person; and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the target response intent determination module is to: and inputting the first combined feature into a trained intention recognition model, and processing the first combined feature by the intention recognition model by using a model parameter corresponding to the first combined feature to obtain a target response intention corresponding to the target specific person.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: outputting a first query statement; acquiring the voice and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person; acquiring a plurality of behavior characteristics corresponding to the target specific person based on the specific person image to obtain a behavior characteristic set; acquiring a plurality of voice features corresponding to the target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; determining a target response intent corresponding to the target specific person based on the first combined features.
In some embodiments, the obtaining a plurality of speech features corresponding to the target specific person based on the voice of the specific person, and obtaining a speech feature set includes: and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
In some embodiments, the obtaining a plurality of speech features corresponding to the target specific person based on the voice of the specific person, and obtaining a speech feature set includes: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; the computer program further causes the processor to perform the steps of: combining the target semantics with intonation change information in the voice attribute information to obtain a second combination characteristic; the determining a target response intent corresponding to the target specific person based on the first combined feature comprises: and determining a target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
In some embodiments, the obtaining a plurality of behavior features corresponding to the target specific person based on the specific person image, and obtaining a behavior feature set includes: and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by using the trained expression recognition model to obtain the target expression corresponding to the target user.
In some embodiments, the computer program further causes the processor to perform the steps of: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the determining a target query policy based on the target response intention corresponding to the target specific person, the querying the target specific person based on the target query policy includes: acquiring a second inquiry statement for inquiring the target specific person; determining a corresponding target query intonation according to the target reply intention; obtaining target inquiry voice according to the second inquiry statement and the target inquiry tone; and outputting the target inquiry voice.
In some embodiments, the obtaining a second query statement that queries the target specific person includes: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
In some embodiments, the obtaining a target query voice according to the second query statement and the target query intonation includes: acquiring background attribute information corresponding to the target specific person; modifying the second query statement according to the background attribute information to obtain a modified second query statement; and obtaining target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
In some embodiments, the determining a target query policy based on the target response intention corresponding to the target specific person, the querying the target specific person based on the target query policy includes: acquiring an avatar corresponding to the digital person; determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person; and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, said determining a target response intent for said target specific person based on said first combined feature comprises: and inputting the first combined feature into a trained intention recognition model, and processing the first combined feature by the intention recognition model by using a model parameter corresponding to the first combined feature to obtain a target response intention corresponding to the target specific person.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: outputting a first query statement; acquiring the voice and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person; acquiring a plurality of behavior characteristics corresponding to the target specific person based on the specific person image to obtain a behavior characteristic set; acquiring a plurality of voice features corresponding to the target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; determining a target response intent corresponding to the target specific person based on the first combined features.
In some embodiments, the obtaining a plurality of speech features corresponding to the target specific person based on the voice of the specific person, and obtaining a speech feature set includes: and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
In some embodiments, the obtaining a plurality of speech features corresponding to the target specific person based on the voice of the specific person, and obtaining a speech feature set includes: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; the computer program further causes the processor to perform the steps of: combining the target semantics with intonation change information in the voice attribute information to obtain a second combination characteristic; the determining a target response intent corresponding to the target specific person based on the first combined feature comprises: and determining a target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
In some embodiments, the obtaining a plurality of behavior features corresponding to the target specific person based on the specific person image, and obtaining a behavior feature set includes: and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by using the trained expression recognition model to obtain the target expression corresponding to the target user.
In some embodiments, the computer program further causes the processor to perform the steps of: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the determining a target query policy based on the target response intention corresponding to the target specific person, the querying the target specific person based on the target query policy includes: acquiring a second inquiry statement for inquiring the target specific person; determining a corresponding target query intonation according to the target reply intention; obtaining target inquiry voice according to the second inquiry statement and the target inquiry tone; and outputting the target inquiry voice.
In some embodiments, the obtaining a second query statement that queries the target specific person includes: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
In some embodiments, the obtaining of the target query voice according to the second query statement and the target query intonation includes obtaining background attribute information corresponding to the target specific person; modifying the second query statement according to the background attribute information to obtain a modified second query statement; and obtaining target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
In some embodiments, the determining a target query policy based on the target response intention corresponding to the target specific person, the querying the target specific person based on the target query policy includes: acquiring an avatar corresponding to the digital person; determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person; and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, said determining a target response intent for said target specific person based on said first combined feature comprises: and inputting the first combined feature into a trained intention recognition model, and processing the first combined feature by the intention recognition model by using a model parameter corresponding to the first combined feature to obtain a target response intention corresponding to the target specific person.
After the first query sentence is output, the voice of the specific person and the image of the specific person corresponding to the first query sentence responded by the target specific person are obtained to determine the behavior characteristic and the voice characteristic corresponding to the target specific person.
Drawings
FIG. 1 is a diagram of an exemplary environment in which a digital person-based person-specific query method may be implemented;
FIG. 2 is a schematic flow diagram of a digital person-based person-specific query method in one embodiment;
FIG. 3 is a schematic flow diagram of a digital person-based person-specific query method in one embodiment;
FIG. 4A is a flowchart illustrating the steps of determining a target query policy based on the target response intent corresponding to the target specific person and querying the target specific person based on the target query policy, in one embodiment;
FIG. 4B is a schematic diagram of an interface for a digital person to perform a query in some embodiments;
FIG. 5 is a block diagram of an apparatus for a digital person-based person-specific inquiry method in one embodiment;
FIG. 6 is a block diagram that illustrates the structure of a target query policy determination module in one embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The digital person-based person-specific query method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 is placed in an area where the target specific person is located, for example, when the target specific person is a suspected person, the terminal is placed in an interrogation room for interrogation, the server 104 may output a first interrogation sentence to the terminal 102, the terminal 102 may output the first interrogation sentence in a voice or text manner, a camera and a recording device may be installed on the terminal 102, when the target specific person replies the first interrogation sentence, recording and image acquisition may be performed to acquire a voice and an image of the target specific person when the target specific person replies the first interrogation sentence, and the voice and the image are sent to the server 104, and the server 104 acquires a plurality of behavioral features corresponding to the target specific person based on the image of the specific person to obtain a behavioral feature set; acquiring a plurality of voice features corresponding to a target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; a target response intent corresponding to the target specific person is determined based on the first combined features. After obtaining the target response intention, the server 104 may send the target response intention to the terminal 102, or may determine a next query sentence based on the target response intention to query the target specific person.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. It is understood that the person-specific inquiry method based on digital persons according to the embodiment of the present application may be executed at the terminal 102. The digital person in the embodiment of the present application is a virtual person, and may refer to a virtual person that can assist or replace a real person to perform a task, for example, a set of developed programs may be used to assist or replace the real person to inquire about a criminal suspect by executing the program.
In one embodiment, as shown in fig. 2, there is provided a digital person-based person-specific query method, which is illustrated by way of example as applied to the server in fig. 1, and which comprises the steps of:
in step S202, a first query statement is output.
Wherein the query sentence is a sentence for querying a specific person. The specific person is a specific person, and is a user who needs to determine the authenticity of a response to the first query sentence, and may be a criminal suspect, for example. The first question sentence may be randomly extracted from a question sentence library in which a plurality of candidate question sentences may be stored. The first query statement may also be obtained according to attribute information of the target specific person, for example, the server may obtain a video image of the target specific person, perform face detection according to the video image, obtain identity information of the target specific person according to a face recognition technique, obtain at least one of occupation, age, or native place of the target specific person from an attribute information database according to the identity information of the target specific person, and obtain a matching problem according to the attribute information to query the criminal suspect. For example, for the same question, the way in which the question is described may be different for particular people of different professions. For example, for the problem of crime hours, assuming that the area involved in the criminal suspect's profession is finance, the query statement may be "is the transaction day or the non-transaction day on the day of the issue? ". Assuming that the area involved in the criminal suspect's profession is legal, the query statement may be "whether the day of the incident is in a court or office all day? ".
Specifically, the server may send the first query sentence to the terminal, and the terminal may present or play the first query sentence. For example, an avatar such as a 3D (3Dimensions) avatar may be displayed on the screen of the terminal, and when the terminal receives the first query sentence, the avatar is controlled to play the first query sentence by voice
Step S204, the voice and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person are obtained.
The target specific person is a person who needs to be asked, and the first question sentence is output for the target specific person, so that the target specific person is required to reply to the first question sentence. The specific person voice and the specific person image are acquired in real time while the target specific person is replying to the first question sentence. For example, when the first query sentence is played completely, the criminal suspect starts to answer, and a voice and an image of the first query sentence played completely until the target specific person finishes answering may be acquired.
Specifically, the terminal may control the sensing device to obtain the voice information and the image, for example, a recording device may be used for recording, or a video shooting device may be used for shooting, so as to obtain the voice of the specific person and the image of the specific person corresponding to the target specific person answering the first query sentence. The terminal can send the voice of the specific person and the image of the specific person to the server in real time, and the server obtains the voice of the specific person and the image of the specific person.
Step S206, acquiring a plurality of behavior characteristics corresponding to the target specific person based on the specific person image to obtain a behavior characteristic set.
The behavior feature is a feature for representing a behavior characteristic. For example, may be a mental feature, an expressive feature, a gestural feature, a postural feature, or a five sense organ behavioral feature, which may include at least one of a behavioral feature corresponding to the eyes or a behavioral feature corresponding to the nose. The corresponding behavioral characteristic of the eye may be, for example, at least one of open or closed. The behavioral characteristic corresponding to the nose may be, for example, at least one of a nasal inhalation or a nasal exhalation.
The behavior features may be identified based on an artificial intelligence model. For example, a model for identifying behavior features may be trained in advance. The model may be obtained through supervised training. The method comprises the steps of obtaining a training image and a label (behavior characteristic) corresponding to the training image, inputting the training image into a behavior characteristic recognition model to be trained, outputting a predicted behavior characteristic, obtaining a model loss value according to the difference between the predicted behavior characteristic and the label, adjusting model parameters towards the direction of the reduction of the model loss value until the model converges, wherein the model converging condition can be that the model loss value is smaller than a preset threshold value, the difference between the behavior characteristic and the label is in positive correlation with the model loss value, and the larger the difference is, the larger the model loss value is.
In some embodiments, the server may perform image recognition on the image of the specific person according to a computer vision technique to obtain at least one of micro expression or expression information of the criminal suspect so as to obtain a behavior feature corresponding to the target specific person, where the behavior feature set may include one or more behavior features. Plural means at least two. For example, a specific person image may be input into a behavior feature recognition model, which outputs behavior features.
In some embodiments, the behavioral characteristics may include expressions. Obtaining a plurality of behavior characteristics corresponding to the target specific person based on the specific person image, and obtaining a behavior characteristic set comprises: and acquiring the face characteristics corresponding to the target specific person based on the specific person image, and processing the face characteristics by using the trained expression recognition model to obtain the target expression corresponding to the target user.
Specifically, the expression is an emotional feeling expressed on the face, and may be, for example, startle, excitement, anger, or the like. The face features are features related to a face, and may be, for example, features corresponding to eyes, features corresponding to a mouth, and features corresponding to a nose. The human face features can be extracted by a human face feature extraction model, and the human face feature extraction model can be a deep learning model. A plurality of face feature extraction models may be included, and for example, at least one of a model extracting a feature corresponding to an eye or a model extracting a feature corresponding to a mouth may be included. The facial feature extraction model and the expression recognition model can be cascaded and obtained by performing combined training during model training. For example, the training image may be input into a facial feature extraction model to obtain facial features, and the facial features may be input into an expression recognition model to obtain predicted expressions. And obtaining a model loss value according to the difference between the predicted expression and the actual expression, and adjusting the parameters of the model according to a gradient descent method. And the difference between the predicted expression and the actual expression is in positive correlation with the model loss value. Therefore, the facial feature extraction model and the expression recognition model can be obtained through the combined training.
Step S208, a plurality of voice characteristics corresponding to the target specific person are obtained based on the voice of the specific person, and a voice characteristic set is obtained.
Wherein the voice feature is a feature for representing a characteristic of the voice. For example, may include at least one of intonation or speech rate. Intonation refers to the change of sound rise and fall in a sentence. For example, it may be raised, lowered, or ramped, etc. One or more speech features may be included in the set of speech features. The change of the frequency of the voice can be counted to obtain the tone characteristics.
Specifically, the server may perform speech feature recognition on a specific person's speech by using a natural language processing technique to obtain a speech feature set. For example, the server obtains voice attribute information corresponding to the target specific person based on the voice of the specific person, and the voice attribute information includes at least one of voice speed information or voice tone change information. The intonation change information may be counted in units of preset time lengths, for example, the average voice frequency corresponding to a time period corresponding to each preset time length is calculated, and the intonation is determined according to the change of the average voice frequency between adjacent time periods. For example, assuming that the preset time period is 1 second, the average speech frequency corresponding to the 1 st second, the average speech frequency corresponding to the 2 nd second, and the average speech frequency corresponding to the 2 nd second may be acquired, and when the speech pitch change information is continuously increased, the speech pitch change information is increased.
Step S210, combining the features in the behavior feature set and the features in the speech feature set to obtain a first combined feature.
Specifically, when the speech features in the behavior feature set are combined, all the speech features may be combined together, or some speech features may be combined together. For example, there may be a plurality of the first combined features. The first combined feature comprises at least one feature in the behavior feature set and at least one feature in the voice feature set. The combined features are processed as a whole.
Specifically, the server may obtain at least one behavior feature from the behavior feature set, obtain at least one speech feature from the speech feature set, and combine the obtained features to obtain the first combined feature.
In some embodiments, the specific person voice and the specific person image exist for a certain time length, and when the specific person voice and the specific person image are combined, the behavioral characteristics and the voice characteristics in the same time range can be combined, so that the psychological state of the target specific person, such as a criminal suspect, in the same time can be represented.
In some embodiments, when combining, it may also be to combine the behavior feature in a first time period with the speech feature in a second time period, where the first time period is different from the second time period, and the first time period and the second time period may be adjacent time periods. By combining the voice feature and the behavior feature of the adjacent time periods, the psychological activities of a specific person such as a criminal suspect can be reflected. For example, when a person lies, the speed of speech may be faster, and often some specific behavior is performed before or after lying, such as touching the nose, etc., so that combining the voice characteristics and behavior characteristics of adjacent time periods can further reflect the psychological activities of the criminal suspect to determine whether the criminal suspect lies.
In step S212, a target response intention corresponding to the target specific person is determined based on the first combined feature.
The target reply intention refers to the intention in reply, and the reply intention is used for reflecting the true degree of the reply. The target response intent may be lie or real-talk. The target response intention corresponding to the first combined feature may be obtained according to a predetermined judgment rule. For example, it may be set that when the speech rate is higher than the preset speech rate and there is a nose touching behavior in an adjacent time period after the speech rate is higher than the preset speech rate, the target reply intention is determined to be lying.
According to the specific person inquiring method based on the digital person, after the first inquiring sentence is output, the specific person voice and the specific person image corresponding to the first inquiring sentence responded by the target specific person are obtained, so that the behavior characteristic and the voice characteristic corresponding to the target specific person are determined.
In some embodiments, the server may input the first combined feature into a trained intent recognition model, and the intent recognition model processes the first combined feature by using model parameters corresponding to the first combined feature to obtain a target response intent corresponding to the target specific person. The intent recognition model may be, for example, a neural network model. The intention recognition model may be obtained through supervised training, and model parameters corresponding to the first combined features are obtained through model training.
In some embodiments, when the first combined feature includes a plurality of first combined features, the corresponding target reply intent may be determined based on the determination rule corresponding to each of the first combined features. And synthesizing the target reply intention corresponding to each first combined characteristic to determine a final target reply intention. For example, the target response intentions corresponding to the first combined feature may be counted, and the most numerous intentions may be used as the final target response intentions. For example, assume that there are 5 first combined features, and assume that there are 4 first combined features corresponding to the target response intention lying, and 1 first combined feature corresponding to the target response intention saying real, then the final target response intention lying. The final target response intention is determined through the target response intentions corresponding to the first combined features, namely, multi-level analysis is performed, so that the intention analysis accuracy is improved.
In some embodiments, when determining the target response intention, the response intention corresponding to the forward query sentence may be obtained as the forward response intention, and the target response intention corresponding to the target specific person may be determined according to the forward response intention. Here, the forward query sentence refers to a sentence that queries a target specific person before the first query sentence. For example, the number of lying attempts (referred to as forward lying number) in response intentions corresponding to each combined feature corresponding to the forward query sentence may be obtained, and when the number of lying attempts corresponding to the target response intention corresponding to the first combined feature is greater than the forward lying number, it is determined that the target specific person lies. For example, if the answer intention corresponding to 3 combined features was lie when the answer intention was last identified, and the answer intention corresponding to 4 combined features was lie this time, it indicates that the target specific user is likely to lie.
In some embodiments, in the inquiry process, the digital person may obtain a specific person image and a specific person voice for an intention analysis, where the preset time length is, for example, 10 minutes, the obtained voice feature and behavior feature may be used to generate an intention result corresponding to the target specific person, and the terminal may further display the intention result to the inquiring police, for example, in a terminal corresponding to the police, so as to facilitate them to adjust the inquiry method.
In some embodiments, the intention recognition result obtained in each time period (referred to as an intermediate intention recognition result) may also be saved, and after the query is finished, a final intention analysis result is obtained based on the intermediate diagram recognition result, that is, a final intention summary result of the criminal suspect is obtained. For example, a change rule of the intermediate intention recognition result may be acquired and output. The change may be, for example, almost lying in the first 20 minutes and gradually saying real words after 20 minutes. The multilevel analysis by using the intermediate intention recognition result can lead the final result to be more accurate, and is also beneficial to the inquiring police to adjust the inquiring method midway to conjecture the psychological activity state of the criminal suspect so as to better execute the inquiry.
In some embodiments, the law of change of the response intent may be obtained. For example, the server may further output a probability corresponding to the response intention, determine a probability change rule, and output a change situation of the mental state of the specific person according to the probability change rule. For example, assuming that the probability of saying a real word gradually becomes larger, it is determined that the psychological state change result is gradually inclined to cooperate. Given the increasing probability of lying, it is determined that the mental state change results are increasingly prone to resist interrogation.
In some embodiments, as shown in FIG. 3, the digital person-based person-specific query method may further include the steps of:
step S302, semantic analysis is carried out on the voice of the specific person, and target semantics corresponding to the voice of the specific person are obtained.
Semantics refers to what is expressed in a reply sentence. For example, for a criminal suspect's response, the target meaning may be an admission of the criminal fact or whether the criminal fact is admitted. The target semantics can be obtained based on the recognition of a semantic recognition model, and the voice recognition model is an artificial intelligence model.
Step S304, combining the target semantics with the intonation change information in the voice attribute information to obtain a second combination characteristic.
The step S212 of determining the target response intention corresponding to the target specific person based on the first combined feature includes: and determining the target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
Specifically, since the intonation changes corresponding to different semantics differ, the semantic and the intonation change information are combined to obtain the second combined feature, and the mental state activity expressed by the language when the target specific person expresses the meaning of the target specific person can be mined and obtained, the target response intention of the target specific person is determined based on the second combined feature, for example, the first combined feature and the second combined feature are input into the intention recognition model to obtain the target response intention, and the accuracy of the obtained target response intention can be improved.
In some embodiments, the digital person-based person-specific query method may further comprise the steps of: and determining a target inquiry strategy based on the target response intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
The query policy refers to a policy to be taken by querying. The policy may include at least one of a tone of a language of the question, a manner of the question, an image of the questioner, or a type of question asked. The corresponding relation between the reply intention and the inquiry strategy can be preset, so that the corresponding target inquiry strategy can be obtained according to the target reply intention. And inquiring the target specific person according to the target inquiry strategy. By adopting different reply strategies for different reply intentions, the query can be made more efficient.
For example, if the answer intention is lie, the intonation and the severity of the inquiry can be increased, so that a criminal suspect is stressed, an inquiry sentence with a semantic similar to that of the first inquiry sentence is obtained, and an inquiry is made to a target specific person, so that the target specific person can answer more targeted questions, and the inquiry effect is improved.
For another example, assume that the reply is intended to be a real word. Then query statements with similar semantics as the first query statement may be skipped and additional questions may be obtained for querying.
For another example, assuming that the answer is intended to lie, the questioning mode may be changed to be questioned by a real person such as a police so that questioning is more efficient.
In some embodiments, determining a target query policy based on the target response intent corresponding to the target specific person, querying the target specific person based on the target query policy comprises: acquiring a virtual image corresponding to a digital person; determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person; and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In particular, an avatar means that the avatar is virtually obtained and not a real user avatar, which may be a cartoon avatar, for example. The avatar adjustment parameter refers to a parameter for adjusting the avatar, and includes at least one of an avatar adjustment parameter corresponding to a human face or an avatar adjustment parameter corresponding to a gesture, for example. Different response intentions may correspond to different avatar adjustment parameters. The corresponding relation between the reply intention and the character adjusting parameter is preset. For example, when the target response intention is a real word, the character adjustment parameter is a mild character adjustment parameter for adjusting the character to a mild state. When the target response intention is to say dismay, the image adjustment parameter is a strict image adjustment parameter used for adjusting the image to a strict state. As a practical example, the mild character adjustment parameters may include a character adjustment parameter that adjusts a face to be smiling. The severe avatar adjustment parameters may include avatar adjustment parameters that adjust the face to a severe expression and avatar adjustment parameters that adjust the gesture to a table clapping gesture. After the virtual image is subjected to image adjustment, the virtual image subjected to image adjustment can be controlled to inquire a target specific person.
In some embodiments, as shown in fig. 4A, determining a target query policy based on the target response intention corresponding to the target specific person, querying the target specific person based on the target query policy includes the following steps:
in step S402, a second query sentence for querying the target specific person is acquired.
Specifically, the second query sentence may be randomly selected or determined according to the target reply intention. For example, assuming that the target response is intended to lie, a sentence having a semantic similar to that of the first query sentence is acquired as the second query sentence to further determine the fact by the query having a semantic similar. Assuming that the target reply intention is a real word, the second query sentence can be obtained according to the preset sentence query sequence to query the criminal suspect.
In some embodiments, obtaining the second query statement for querying the target specific person comprises: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
Specifically, the server can recognize the meaning expressed by the voice of the specific person according to the semantic recognition model, and the target semantic meaning is obtained. For example, whether the crime fact is acknowledged or denied, the query sentence library may store query sentences corresponding to semantics, that is, correspondence between the semantics and the query sentences, so that the corresponding query sentences may be obtained according to the target semantics and used as the second query sentences. By acquiring the corresponding target inquiry sentences from the inquiry sentence library according to the meaning expressed when the target specific person, such as a criminal suspect, replies, the digital person can flexibly inquire according to the reply of the target specific person, and the inquiry effect is improved.
For example, assuming that the target semantic is that a criminal suspect indicates that he or she is eating at a restaurant at that time, an inquiry sentence related to eating at the restaurant, for example, an inquiry sentence inquiring about information related to the restaurant or an inquiry sentence inquiring about details of eating, may be acquired. For example, a query sentence for asking the name of a restaurant or a query sentence for asking dishes ordered at the restaurant.
In some embodiments, a target entity in the target semantics may be obtained, and an inquiry statement corresponding to the target entity may be obtained as the second inquiry statement. For example, the associated entity having the association corresponding to the target entity may be obtained in the knowledge graph, and the corresponding second query statement may be obtained according to the associated entity. For example, assume that the target semantic is "i have gone through xx park that day", assume that in the knowledge map, the corresponding entries of "xx park" include "a entry" and "B entry", and the attributes of "a entry" and "B entry" are "entry" according to the knowledge map, and therefore, a question related to "entry", for example, "you have entered park from a entry or B entry" can be obtained.
Step S404, determining the corresponding target query intonation according to the target reply intention.
Specifically, the corresponding relationship between the response intention and the query intonation is preset, so that after the server obtains the target response intention, the server can obtain the corresponding target query intonation according to the corresponding relationship. For example, it may be set that when the answer is intended to lie, then the retrieved query intonation is rising. When the answer is intended to say the real speech, the tone corresponding to the usual tone, for example, the tone of balance, is obtained.
Step S406, obtaining a target query voice according to the second query statement and the target query intonation.
Specifically, when the target query utterance is obtained, the second query statement may be read aloud by using the target query utterance according to a speech synthesis technique, so as to obtain the target query utterance.
In some embodiments, obtaining the target query voice according to the second query statement and the target query intonation includes obtaining background attribute information corresponding to the target specific person; modifying the second query statement according to the background attribute information to obtain a modified second query statement; and obtaining the target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
The background attribute information is attribute information indicating the background of the user, such as age, occupation, or hobby. After the background attribute information is obtained, the second query statement may be modified based on the background attribute information, so that the modified second query statement is more matched with the background of the target specific person. Thus increasing the effectiveness of the interrogation.
In some embodiments, the nouns in the second query statement may be modified based on the context attribute information. For example, for criminal suspects of different professions, the nouns in the second query sentence may be modified according to the profession. For example, a sentence generation model for generating one sentence from another sentence may be acquired, a second query sentence and background attribute information may be input to the sentence generation model, and the sentence generation model rewrites the second query sentence according to the background attribute information to generate a semantically approximate query sentence as a modified second query sentence. By modifying the second training sentence according to the background attribute information, the modified second query sentence matched with the background attribute information can be obtained, so that the target specific person can better understand the query sentence.
In step S408, the target inquiry voice is output.
Specifically, the server may send the target query voice to the terminal, and the terminal may play the target query voice through the voice playing device.
The method provided by the embodiment of the application can be applied to a scene of inquiring the criminal suspect, the conventional inquiring method of the criminal suspect is usually the face-to-face investigation inquiry of case handling policemen, and the criminal suspect is inquired by using knowledge and experience obtained by professional training of the policemen so as to obtain real and complete information. But high quality police resources are scarce. Trained criminal suspects have the potential to deceive, conceal the fact, and they may achieve an illegal goal by exploiting the weaknesses of real police. A virtual digital human police officer may be created to interrogate the criminal suspect. The virtual digital policeman can collect video information and sound information of the criminal suspect through the video and voice collecting equipment for subsequent reply intention identification. The virtual digital policeman can also display inquiry sentences through video and voice playing equipment so as to interact with the criminal suspect. And the picture and the video evidence can be displayed through the video equipment.
In addition, massive data such as a complete knowledge map and a crime database can be used as a basis for analyzing the response intention of a virtual digital police, so that the criminal suspect can be inquired. For example, the server can analyze the speaking authenticity of the criminal suspect by combining a complete psychological and criminal psychological database equipped by the digital person, and can adjust the strategy of inquiring questions according to the obtained target response intention, such as speaking. Because the digital person has neutrality, the digital person can not be puzzled by the exaggeration skill of the criminal suspect, so that real and effective information can be obtained, and the digital person has strong reproducibility, can be popularized to various police departments, and enables each police department to have the digital person with the same super-strong ability.
The digital person in the embodiment of the application can be represented by the virtual image, and the virtual image can be adjusted according to the target response intention corresponding to the target specific person, for example, when the response intention is lie, the emotional state of the virtual image is adjusted to be severe. As shown in fig. 4B, which is an interface schematic diagram for digital person to perform inquiry in some embodiments, a virtual police image may be displayed on the suspect inquiry interface, and when it is not detected that the target specific person lies, the virtual police image is in a normal state, and when it is detected that the target specific person lies, the virtual police image is adjusted to a strict state, and the terminal may control the virtual police image to issue a voice inquiry sentence, for example, "is you entering the park from which entrance on the day? ".
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.
In one embodiment, as shown in fig. 5, there is provided a digital person-based specific person query device, comprising: a first query sentence output module 502, an information acquisition module 504, a behavior feature set obtaining module 506, a speech feature set obtaining module 508, a first combining module 510, and a target reply intention determining module 512, wherein:
a first query statement output module 502, configured to output a first query statement.
The information obtaining module 504 is configured to obtain a specific person voice and a specific person image corresponding to the target specific person answering the first query sentence.
A behavior feature set obtaining module 506, configured to obtain a plurality of behavior features corresponding to the target specific person based on the specific person image, so as to obtain a behavior feature set.
A voice feature set obtaining module 508, configured to obtain, based on the voice of the specific person, a plurality of voice features corresponding to the target specific person, so as to obtain a voice feature set.
And a first combining module 510, configured to combine features in the behavior feature set with features in the speech feature set to obtain a first combined feature.
And a target response intention determining module 512, configured to determine a target response intention corresponding to the target specific person based on the first combined feature.
In some embodiments, the speech feature set derivation module is to: and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
In some embodiments, the speech feature set derivation module is to: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; the specific person inquiry device based on the digital person further comprises a second combination module used for combining the target semantics with the intonation change information in the voice attribute information to obtain a second combination characteristic; the target response intent determination module is to: and determining the target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
In some embodiments, the behavior feature set derivation module is to: and acquiring the face characteristics corresponding to the target specific person based on the specific person image, and processing the face characteristics by using the trained expression recognition model to obtain the target expression corresponding to the target user.
In some embodiments, the digital person-based specific person query device further comprises a target query policy determination module for determining a target query policy based on the target response intention corresponding to the target specific person, and querying the target specific person based on the target query policy.
In some embodiments, as shown in fig. 6, the target query policy determination module includes:
a second query sentence acquisition unit 602, configured to acquire a second query sentence for querying the target specific person.
And a target query intonation determining unit 604, configured to determine a corresponding target query intonation according to the target response intention.
A target query speech obtaining unit 606, configured to obtain a target query speech according to the second query statement and the target query intonation.
A target query voice output unit 608 for outputting the target query voice.
In some embodiments, the second query statement acquisition unit is configured to: performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person; and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
In some embodiments, the target query speech obtaining unit is configured to: acquiring background attribute information corresponding to a target specific person; modifying the second query statement according to the background attribute information to obtain a modified second query statement; and obtaining the target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
In some embodiments, the target interrogation policy determination module is to: acquiring an avatar corresponding to the digital person; determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person; and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the target response intent determination module is to: and inputting the first combined features into a trained intention recognition model, and processing the first combined features by the intention recognition model by using model parameters corresponding to the first combined features to obtain a target response intention corresponding to the target specific person.
For specific limitations of the digital person-based specific person query method device, reference may be made to the above limitations of the digital person-based specific person query method, which are not described herein again. The respective modules in the above-described digital person-based specific person inquiry method apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing person-specific inquiry method data based on digital persons. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a digital person-based person-specific interrogation method.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: outputting a first query statement; acquiring a specific person voice and a specific person image corresponding to the first inquiry sentence responded by the target specific person; acquiring a plurality of behavior characteristics corresponding to a target specific person based on the specific person image to obtain a behavior characteristic set; acquiring a plurality of voice features corresponding to a target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; a target response intent corresponding to the target specific person is determined based on the first combined features.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: outputting a first query statement; acquiring a specific person voice and a specific person image corresponding to the first inquiry sentence responded by the target specific person; acquiring a plurality of behavior characteristics corresponding to a target specific person based on the specific person image to obtain a behavior characteristic set; acquiring a plurality of voice features corresponding to a target specific person based on the voice of the specific person to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; a target response intent corresponding to the target specific person is determined based on the first combined features.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (16)

1. A digital person-based person-specific query method, the method comprising:
outputting a first query statement;
acquiring the voice and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person;
acquiring a plurality of behavior characteristics corresponding to the target specific person based on the specific person image to obtain a behavior characteristic set;
acquiring a plurality of voice features corresponding to the target specific person based on the voice of the specific person to obtain a voice feature set;
combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature;
determining a target response intent corresponding to the target specific person based on the first combined features.
2. The method according to claim 1, wherein the obtaining a plurality of voice features corresponding to the target specific person based on the voice of the specific person, and obtaining a voice feature set comprises:
and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
3. The method according to claim 2, wherein the obtaining a plurality of voice features corresponding to the target specific person based on the voice of the specific person, and obtaining a voice feature set comprises:
performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person;
the method further comprises the following steps:
combining the target semantics with intonation change information in the voice attribute information to obtain a second combination characteristic;
the determining a target response intent corresponding to the target specific person based on the first combined feature comprises:
and determining a target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
4. The method according to claim 1, wherein the obtaining a plurality of behavior features corresponding to the target specific person based on the specific person image, and obtaining a behavior feature set comprises:
and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by using the trained expression recognition model to obtain the target expression corresponding to the target user.
5. The method of claim 1, further comprising:
and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
6. The method according to claim 5, wherein said determining a target query policy based on the target response intention corresponding to the target specific person, said querying the target specific person based on the target query policy comprises:
acquiring a second inquiry statement for inquiring the target specific person;
determining a corresponding target query intonation according to the target reply intention;
obtaining target inquiry voice according to the second inquiry statement and the target inquiry tone;
and outputting the target inquiry voice.
7. The method of claim 6, wherein the obtaining a second query statement that queries the target specific person comprises:
performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person;
and acquiring an inquiry statement corresponding to the target semantic from the inquiry statement library as a second inquiry statement.
8. The method according to claim 6, wherein the obtaining a target query voice according to the second query statement and the target query intonation comprises:
acquiring background attribute information corresponding to the target specific person;
modifying the second query statement according to the background attribute information to obtain a modified second query statement;
and obtaining target inquiry voice according to the modified second inquiry statement and the target inquiry tone.
9. The method according to claim 5, wherein said determining a target query policy based on the target response intention corresponding to the target specific person, said querying the target specific person based on the target query policy comprises:
acquiring an avatar corresponding to the digital person;
determining a corresponding target image adjusting parameter based on the target response intention corresponding to the target specific person;
and performing image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
10. The method according to claim 1, wherein said determining a target response intent corresponding to the target specific person based on the first combined feature comprises:
and inputting the first combined feature into a trained intention recognition model, and processing the first combined feature by the intention recognition model by using a model parameter corresponding to the first combined feature to obtain a target response intention corresponding to the target specific person.
11. A digital person-based person-specific interrogation apparatus, said apparatus comprising:
a first query statement output module for outputting a first query statement;
the information acquisition module is used for acquiring the voice of the specific person and the image of the specific person corresponding to the first inquiry sentence replied by the target specific person;
a behavior feature set obtaining module, configured to obtain, based on the specific person image, a plurality of behavior features corresponding to the target specific person to obtain a behavior feature set;
a voice feature set obtaining module, configured to obtain, based on the voice of the specific person, a plurality of voice features corresponding to the target specific person, so as to obtain a voice feature set;
the first combination module is used for combining the features in the behavior feature set and the features in the voice feature set to obtain first combination features;
and the target response intention determining module is used for determining the target response intention corresponding to the target specific person based on the first combined characteristic.
12. The apparatus of claim 11, wherein the speech feature set derivation module is configured to:
and acquiring voice attribute information corresponding to the target specific person based on the voice of the specific person, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information.
13. The apparatus of claim 12, wherein the speech feature set derivation module is configured to:
performing semantic analysis on the voice of the specific person to obtain target semantics corresponding to the voice of the specific person;
the device further comprises:
the second combination module is used for combining the target semantics with the intonation change information in the voice attribute information to obtain a second combination characteristic;
the target response intent determination module is to:
and determining a target response intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
14. The apparatus of claim 11, further comprising:
and the target inquiry strategy determining module is used for determining a target inquiry strategy based on the target response intention corresponding to the target specific person and inquiring the target specific person based on the target inquiry strategy.
15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.
16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 10.
CN202010847705.2A 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person Active CN112151027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010847705.2A CN112151027B (en) 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010847705.2A CN112151027B (en) 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person

Publications (2)

Publication Number Publication Date
CN112151027A true CN112151027A (en) 2020-12-29
CN112151027B CN112151027B (en) 2024-05-03

Family

ID=73888942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010847705.2A Active CN112151027B (en) 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person

Country Status (1)

Country Link
CN (1) CN112151027B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409822A (en) * 2021-05-31 2021-09-17 青岛海尔科技有限公司 Object state determination method and device, storage medium and electronic device
CN113724705A (en) * 2021-08-31 2021-11-30 平安普惠企业管理有限公司 Voice response method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705357A (en) * 2017-09-11 2018-02-16 广东欧珀移动通信有限公司 Lie detecting method and device
US20180365257A1 (en) * 2017-06-19 2018-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatu for querying
CN109829358A (en) * 2018-12-14 2019-05-31 深圳壹账通智能科技有限公司 Micro- expression loan control method, device, computer equipment and storage medium
CN110427472A (en) * 2019-08-02 2019-11-08 深圳追一科技有限公司 The matched method, apparatus of intelligent customer service, terminal device and storage medium
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN110689889A (en) * 2019-10-11 2020-01-14 深圳追一科技有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111429267A (en) * 2020-03-26 2020-07-17 深圳壹账通智能科技有限公司 Face examination risk control method and device, computer equipment and storage medium
CN111523981A (en) * 2020-04-29 2020-08-11 深圳追一科技有限公司 Virtual trial method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365257A1 (en) * 2017-06-19 2018-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatu for querying
CN107705357A (en) * 2017-09-11 2018-02-16 广东欧珀移动通信有限公司 Lie detecting method and device
CN109829358A (en) * 2018-12-14 2019-05-31 深圳壹账通智能科技有限公司 Micro- expression loan control method, device, computer equipment and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN110427472A (en) * 2019-08-02 2019-11-08 深圳追一科技有限公司 The matched method, apparatus of intelligent customer service, terminal device and storage medium
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN110689889A (en) * 2019-10-11 2020-01-14 深圳追一科技有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111429267A (en) * 2020-03-26 2020-07-17 深圳壹账通智能科技有限公司 Face examination risk control method and device, computer equipment and storage medium
CN111523981A (en) * 2020-04-29 2020-08-11 深圳追一科技有限公司 Virtual trial method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409822A (en) * 2021-05-31 2021-09-17 青岛海尔科技有限公司 Object state determination method and device, storage medium and electronic device
CN113724705A (en) * 2021-08-31 2021-11-30 平安普惠企业管理有限公司 Voice response method, device, equipment and storage medium
CN113724705B (en) * 2021-08-31 2023-07-25 平安普惠企业管理有限公司 Voice response method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112151027B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
EP3477519B1 (en) Identity authentication method, terminal device, and computer-readable storage medium
JP6986527B2 (en) How and equipment to process video
EP3617946B1 (en) Context acquisition method and device based on voice interaction
WO2016172872A1 (en) Method and device for verifying real human face, and computer program product
JP6951712B2 (en) Dialogue devices, dialogue systems, dialogue methods, and programs
CN108537017B (en) Method and equipment for managing game users
CN110634472B (en) Speech recognition method, server and computer readable storage medium
CN108920640B (en) Context obtaining method and device based on voice interaction
US20200374286A1 (en) Real time selfie systems and methods for automating user identify verification
CN107229691B (en) Method and equipment for providing social contact object
US9922644B2 (en) Analysis of professional-client interactions
CN112151027B (en) Method, device and storage medium for querying specific person based on digital person
CN113380271A (en) Emotion recognition method, system, device and medium
CN112418059A (en) Emotion recognition method and device, computer equipment and storage medium
JP2023530893A (en) Data processing and trading decision system
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN114138960A (en) User intention identification method, device, equipment and medium
CN114155460A (en) Method and device for identifying user type, computer equipment and storage medium
WO2016139655A1 (en) Method and system for preventing uploading of faked photos
Firc Applicability of Deepfakes in the Field of Cyber Security
US20200402511A1 (en) System and method for managing audio-visual data
Schoonvelde et al. Text as Data in Political Psychology
CN111971670A (en) Generating responses in a conversation
Khan et al. Robust Feature Extraction Techniques in Speech Recognition: A Comparative Analysis
Akrout Deep facial emotion recognition model using optimal feature extraction and dual‐attention residual U‐Net classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant