CN112151027B - Method, device and storage medium for querying specific person based on digital person - Google Patents

Method, device and storage medium for querying specific person based on digital person Download PDF

Info

Publication number
CN112151027B
CN112151027B CN202010847705.2A CN202010847705A CN112151027B CN 112151027 B CN112151027 B CN 112151027B CN 202010847705 A CN202010847705 A CN 202010847705A CN 112151027 B CN112151027 B CN 112151027B
Authority
CN
China
Prior art keywords
target
specific person
voice
query
person
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010847705.2A
Other languages
Chinese (zh)
Other versions
CN112151027A (en
Inventor
常向月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202010847705.2A priority Critical patent/CN112151027B/en
Publication of CN112151027A publication Critical patent/CN112151027A/en
Application granted granted Critical
Publication of CN112151027B publication Critical patent/CN112151027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • G10L2025/906Pitch tracking

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application relates to a specific person inquiring method, a specific person inquiring device and a storage medium based on digital persons. The method comprises the following steps: acquiring specific person voice corresponding to a target specific person replying to the first inquiry sentence and a specific person image; acquiring a plurality of behavior features corresponding to a target specific person based on the specific person image to obtain a behavior feature set; acquiring a plurality of voice features corresponding to a target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic. The digital person of the present application can represent the avatar by adjusting the avatar according to the target reply intention corresponding to the target specific person, for example, when the reply intention is lie, the avatar is adjusted to be strict. The method can improve the accuracy of judging the authenticity of the digital person to the specific person replying to the inquiry statement.

Description

Method, device and storage medium for querying specific person based on digital person
Technical Field
The present application relates to the field of man-machine interaction technologies, and in particular, to a specific person inquiry method, apparatus and storage medium based on digital persons.
Background
With the development of science and technology, man-machine interaction technology can be used to complete specific work in many situations, for example, digital people can be used to provide services such as problem solutions and information inquiry for users.
At present, after receiving a sentence input by a user, a numerical person can understand the sentence to determine the meaning contained in the sentence, however, the meaning contained in the sentence often exists, and the situation that the meaning does not accord with the actual situation is not satisfied, namely the authenticity of the meaning expressed by the sentence cannot be determined, and the authenticity judgment accuracy is low.
Disclosure of Invention
Based on this, it is necessary to provide a person-specific inquiry method, device and storage medium based on digital person in view of the above technical problems.
A method of person-specific interrogation based on digital persons, the method comprising: outputting a first query sentence; acquiring specific person voice corresponding to the target specific person replying to the first inquiry sentence and a specific person image; acquiring a plurality of behavior features corresponding to the target specific person based on the specific person image to obtain a behavior feature set; acquiring a plurality of voice features corresponding to the target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
In some embodiments, the obtaining the voice feature set based on the voice of the specific person to obtain the plurality of voice features corresponding to the specific person of the target includes: and acquiring voice attribute information corresponding to the target specific person based on the specific person voice, wherein the voice attribute information comprises at least one of speech speed information and intonation change information.
In some embodiments, the obtaining the voice feature set based on the voice of the specific person to obtain the plurality of voice features corresponding to the specific person of the target includes: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; the method further comprises the steps of: combining the target semantics with intonation change information in the voice attribute information to obtain a second combined feature; the determining, based on the first combined feature, a target reply intent corresponding to the target specific person includes: and determining the target reply intention corresponding to the target specific person based on the first combination feature and the second combination feature.
In some embodiments, the obtaining the behavior feature set based on the specific person image to obtain the plurality of behavior features corresponding to the target specific person includes: and acquiring face features corresponding to the target specific person based on the specific person image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user.
In some embodiments, the method further comprises: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the determining a target query policy based on the target reply intent corresponding to the target specific person, and querying the target specific person based on the target query policy includes: acquiring a second inquiry sentence for inquiring the target specific person; determining a corresponding target inquiry intonation according to the target reply intention; obtaining target inquiry voice according to the second inquiry sentence and the target inquiry intonation; and outputting the target inquiry voice.
In some embodiments, the obtaining a second query statement that queries the target specific person comprises: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
In some embodiments, the obtaining the target query speech according to the second query sentence and the target query intonation includes obtaining background attribute information corresponding to the target specific person; modifying the second query sentence according to the background attribute information to obtain a modified second query sentence; and obtaining target query voice according to the modified second query sentence and the target query intonation.
In some embodiments, the determining a target query policy based on the target reply intent corresponding to the target specific person, and querying the target specific person based on the target query policy includes: acquiring an virtual image corresponding to the digital person; determining corresponding target image adjustment parameters based on target reply intentions corresponding to the target specific person; and carrying out image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the determining, based on the first combined feature, a target reply intent corresponding to the target-specific person includes: and inputting the first combined features into a trained intention recognition model, and processing the first combined features by using model parameters corresponding to the first combined features by the intention recognition model to obtain target reply intentions corresponding to the target specific person.
A digital person-based person-specific interrogation device, the device comprising: the first inquiry sentence output module is used for outputting a first inquiry sentence; the information acquisition module is used for acquiring specific person voice and specific person images corresponding to the target specific person replying to the first inquiry statement; the behavior feature set obtaining module is used for obtaining a plurality of behavior features corresponding to the target specific person based on the specific person image to obtain a behavior feature set; the voice feature set obtaining module is used for obtaining a plurality of voice features corresponding to the target specific person based on the specific person voice to obtain a voice feature set; the first combination module is used for combining the characteristics in the behavior characteristic set and the characteristics in the voice characteristic set to obtain first combination characteristics; and the target reply intention determining module is used for determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
In some embodiments, the speech feature set deriving module is configured to: and acquiring voice attribute information corresponding to the target specific person based on the specific person voice, wherein the voice attribute information comprises at least one of speech speed information and intonation change information.
In some embodiments, the speech feature set deriving module is configured to: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; the apparatus further comprises a second combining module for: combining the target semantics with intonation change information in the voice attribute information to obtain a second combined feature; the target reply intention determining module is used for: and determining the target reply intention corresponding to the target specific person based on the first combination feature and the second combination feature.
In some embodiments, the behavior feature set derivation module is to: and acquiring face features corresponding to the target specific person based on the specific person image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user.
In some embodiments, the apparatus further comprises a target interrogation policy determination module to: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the target interrogation policy determination module comprises: a second inquiry sentence acquisition unit configured to acquire a second inquiry sentence for inquiring about the target specific person; the target inquiry intonation determining unit is used for determining a corresponding target inquiry intonation according to the target reply intention; the target inquiry voice obtaining unit is used for obtaining target inquiry voice according to the second inquiry statement and the target inquiry intonation; and the target inquiry voice output unit is used for outputting the target inquiry voice.
In some embodiments, the second query statement obtaining unit is configured to: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
In some embodiments, the target inquiry voice obtaining unit is configured to obtain background attribute information corresponding to the target specific person; modifying the second query sentence according to the background attribute information to obtain a modified second query sentence; and obtaining target query voice according to the modified second query sentence and the target query intonation.
In some embodiments, the target interrogation policy determination module is to: acquiring an virtual image corresponding to the digital person; determining corresponding target image adjustment parameters based on target reply intentions corresponding to the target specific person; and carrying out image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the target reply intent determination module is to: and inputting the first combined features into a trained intention recognition model, and processing the first combined features by using model parameters corresponding to the first combined features by the intention recognition model to obtain target reply intentions corresponding to the target specific person.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of: outputting a first query sentence; acquiring specific person voice corresponding to the target specific person replying to the first inquiry sentence and a specific person image; acquiring a plurality of behavior features corresponding to the target specific person based on the specific person image to obtain a behavior feature set; acquiring a plurality of voice features corresponding to the target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
In some embodiments, the obtaining the voice feature set based on the voice of the specific person to obtain the plurality of voice features corresponding to the specific person of the target includes: and acquiring voice attribute information corresponding to the target specific person based on the specific person voice, wherein the voice attribute information comprises at least one of speech speed information and intonation change information.
In some embodiments, the obtaining the voice feature set based on the voice of the specific person to obtain the plurality of voice features corresponding to the specific person of the target includes: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; the computer program further causes the processor to perform the steps of: combining the target semantics with intonation change information in the voice attribute information to obtain a second combined feature; the determining, based on the first combined feature, a target reply intent corresponding to the target specific person includes: and determining the target reply intention corresponding to the target specific person based on the first combination feature and the second combination feature.
In some embodiments, the obtaining the behavior feature set based on the specific person image to obtain the plurality of behavior features corresponding to the target specific person includes: and acquiring face features corresponding to the target specific person based on the specific person image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user.
In some embodiments, the computer program further causes the processor to perform the steps of: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the determining a target query policy based on the target reply intent corresponding to the target specific person, and querying the target specific person based on the target query policy includes: acquiring a second inquiry sentence for inquiring the target specific person; determining a corresponding target inquiry intonation according to the target reply intention; obtaining target inquiry voice according to the second inquiry sentence and the target inquiry intonation; and outputting the target inquiry voice.
In some embodiments, the obtaining a second query statement that queries the target specific person comprises: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
In some embodiments, the obtaining the target query speech from the second query sentence and the target query intonation comprises: obtaining background attribute information corresponding to the target specific person; modifying the second query sentence according to the background attribute information to obtain a modified second query sentence; and obtaining target query voice according to the modified second query sentence and the target query intonation.
In some embodiments, the determining a target query policy based on the target reply intent corresponding to the target specific person, and querying the target specific person based on the target query policy includes: acquiring an virtual image corresponding to the digital person; determining corresponding target image adjustment parameters based on target reply intentions corresponding to the target specific person; and carrying out image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the determining, based on the first combined feature, a target reply intent corresponding to the target-specific person includes: and inputting the first combined features into a trained intention recognition model, and processing the first combined features by using model parameters corresponding to the first combined features by the intention recognition model to obtain target reply intentions corresponding to the target specific person.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: outputting a first query sentence; acquiring specific person voice corresponding to the target specific person replying to the first inquiry sentence and a specific person image; acquiring a plurality of behavior features corresponding to the target specific person based on the specific person image to obtain a behavior feature set; acquiring a plurality of voice features corresponding to the target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
In some embodiments, the obtaining the voice feature set based on the voice of the specific person to obtain the plurality of voice features corresponding to the specific person of the target includes: and acquiring voice attribute information corresponding to the target specific person based on the specific person voice, wherein the voice attribute information comprises at least one of speech speed information and intonation change information.
In some embodiments, the obtaining the voice feature set based on the voice of the specific person to obtain the plurality of voice features corresponding to the specific person of the target includes: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; the computer program further causes the processor to perform the steps of: combining the target semantics with intonation change information in the voice attribute information to obtain a second combined feature; the determining, based on the first combined feature, a target reply intent corresponding to the target specific person includes: and determining the target reply intention corresponding to the target specific person based on the first combination feature and the second combination feature.
In some embodiments, the obtaining the behavior feature set based on the specific person image to obtain the plurality of behavior features corresponding to the target specific person includes: and acquiring face features corresponding to the target specific person based on the specific person image, and processing the face features by using a trained expression recognition model to obtain a target expression corresponding to the target user.
In some embodiments, the computer program further causes the processor to perform the steps of: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
In some embodiments, the determining a target query policy based on the target reply intent corresponding to the target specific person, and querying the target specific person based on the target query policy includes: acquiring a second inquiry sentence for inquiring the target specific person; determining a corresponding target inquiry intonation according to the target reply intention; obtaining target inquiry voice according to the second inquiry sentence and the target inquiry intonation; and outputting the target inquiry voice.
In some embodiments, the obtaining a second query statement that queries the target specific person comprises: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
In some embodiments, the obtaining the target query speech according to the second query sentence and the target query intonation includes obtaining background attribute information corresponding to the target specific person; modifying the second query sentence according to the background attribute information to obtain a modified second query sentence; and obtaining target query voice according to the modified second query sentence and the target query intonation.
In some embodiments, the determining a target query policy based on the target reply intent corresponding to the target specific person, and querying the target specific person based on the target query policy includes: acquiring an virtual image corresponding to the digital person; determining corresponding target image adjustment parameters based on target reply intentions corresponding to the target specific person; and carrying out image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the determining, based on the first combined feature, a target reply intent corresponding to the target-specific person includes: and inputting the first combined features into a trained intention recognition model, and processing the first combined features by using model parameters corresponding to the first combined features by the intention recognition model to obtain target reply intentions corresponding to the target specific person.
According to the digital person-based specific person query method, the digital person-based specific person query device, the computer equipment and the storage medium, after the first query statement is output, the specific person voice and the specific person image corresponding to the target specific person in response to the first query statement are acquired to determine the behavior feature and the voice feature corresponding to the target specific person.
Drawings
FIG. 1 is a diagram of an application environment for a digital person-based person-specific interrogation method in one embodiment;
FIG. 2 is a flow diagram of a digital person-based method of querying a particular person in one embodiment;
FIG. 3 is a flow diagram of a digital person-based method of querying a particular person in one embodiment;
FIG. 4A is a flow diagram of a query step for a target specific person based on a target query policy determined based on a target reply intent corresponding to the target specific person, in one embodiment;
FIG. 4B is a schematic diagram of an interface for a digital person to interrogate in some embodiments;
FIG. 5 is a block diagram of the apparatus of the digital person-based person-specific interrogation method in one embodiment;
FIG. 6 is a block diagram of the structure of a target interrogation policy determination module in one embodiment;
fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The specific person inquiring method based on the digital person, provided by the application, can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 is placed in an area where the target specific person is located, for example, when the target specific person replies to the first inquiry statement, the terminal is placed in an interrogation room where inquiry is performed, the server 104 can output a first inquiry statement to the terminal 102, the terminal 102 can output the first inquiry statement in a voice or text mode, a camera and a recording device can be installed on the terminal 102, recording and image acquisition can be performed when the target specific person replies to the first inquiry statement, so that voice and images of the target specific person are obtained when the target specific person replies to the first inquiry statement, the voice and images of the target specific person are sent to the server 104, and the server 104 obtains a plurality of behavior features corresponding to the target specific person based on the images of the specific person to obtain a behavior feature set; acquiring a plurality of voice features corresponding to a target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic. After obtaining the target reply intention, the server 104 may send the target reply intention to the terminal 102, or may determine a next query sentence based on the target reply intention to query the target specific person.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers. It will be appreciated that the digital person-based person-specific interrogation method of embodiments of the present application may also be performed at the terminal 102. The digital person in the embodiment of the application is a virtual person, and can refer to a virtual person which can assist or replace a real person to execute tasks, for example, a set of developed programs can assist or replace the real person to inquire about criminal suspects through executing the programs.
In one embodiment, as shown in fig. 2, a specific person query method based on a digital person is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step S202, outputting a first query sentence.
Wherein an inquiry sentence is a sentence for inquiring about a specific person. The specific person is a specific person, and when it needs to be determined that it replies to the first query sentence, the authenticity of the reply may be, for example, a criminal suspect. The first query sentence may be randomly extracted from a question library, and a plurality of candidate questions may be stored in the question library. The first query sentence may be obtained according to attribute information of the target specific person, for example, the server may obtain a video image of the target specific person, perform face detection according to the video image, obtain identity information of the target specific person according to face recognition technology, obtain at least one of occupation, age or native of the target specific person from an attribute information database according to the identity information of the target specific person, and then query the criminal suspects according to the problem of matching obtained from the attribute information. For example, for the same question, the manner in which the question is described may be different for specific persons of different professions. For example, for the problem of crime time, assuming that the area involved in the occupation of criminal suspects is finance, the inquiry sentence may be "whether the day of the incident is a trade day or a non-trade day? ". Assuming that the area of occupation of criminal suspects is legal, the query statement may be "whether the day of the incident is in the home or in the office throughout the day? ".
Specifically, the server may send the first query sentence to the terminal, and the terminal may show or play the first query sentence. For example, an avatar, such as a 3D (3 Dimensions) avatar, may be presented on the screen of the terminal, and after the terminal receives the first query sentence, the avatar is controlled to play the first query sentence by voice
Step S204, the specific person voice and the specific person image corresponding to the target specific person reply first inquiry sentence are obtained.
The target specific person refers to a person needing to be queried, and the first query sentence is output for the target specific person, so that the target specific person is required to reply for the first query sentence. The specific person voice and the specific person image are acquired in real time when the target specific person replies to the first inquiry sentence. For example, after the first inquiry sentence is played, the criminal suspects start to answer, and the voice and the image from the completion of the first inquiry sentence playing to the completion of the answer by the target specific person can be obtained.
Specifically, the terminal may control the sensing device to obtain voice information and images, for example, may use a recording device to record sound, or may use a video shooting device to shoot sound, so as to obtain a specific person voice and a specific person image corresponding to the first query sentence replied by the specific person of the target. The terminal can send the specific person voice and the specific person image to the server in real time, and the server acquires the specific person voice and the specific person image.
Step S206, a plurality of behavior features corresponding to the target specific person are obtained based on the specific person image, and a behavior feature set is obtained.
Wherein the behavioral characteristics are characteristics for representing behavioral characteristics. For example, the feature may be a mental state feature, an expression feature, a gesture feature, or a facial feature, which may include at least one of a behavioral feature corresponding to eyes or a behavioral feature corresponding to nose. The corresponding behavioral characteristic of the eye may be, for example, at least one of open or closed. The corresponding behavioral characteristic of the nose may be, for example, at least one of nasal inhalation or nasal exhalation.
The behavioral characteristics may be identified from artificial intelligence models. For example, a model for identifying behavioral characteristics may be trained in advance. The model may be obtained through supervised training. The training image and the label (behavior characteristic) corresponding to the training image can be obtained, the training image is input into the behavior characteristic recognition model to be trained, the predicted behavior characteristic is output, a model loss value is obtained according to the difference between the predicted behavior characteristic and the label, model parameters are adjusted towards the direction of the decrease of the model loss value until the model is converged, the condition of model convergence can be that the model loss value is smaller than a preset threshold, the difference between the behavior characteristic and the label is in positive correlation with the model loss value, and the model loss value is larger as the difference is larger.
In some embodiments, the server may perform image recognition on the specific person image according to a computer vision technology to obtain at least one of a microexpressive or a mental state information of the criminal suspect, so as to obtain a behavior feature corresponding to the target specific person, where the behavior feature set may include one or more behavior features. The plurality means at least two. For example, a particular person image may be input into a behavioral feature recognition model, which outputs behavioral features.
In some embodiments, the behavioral characteristics may include expressions. Acquiring a plurality of behavior features corresponding to a target specific person based on the specific person image, wherein the acquiring the behavior feature set comprises: and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by utilizing the trained expression recognition model to obtain the target expression corresponding to the target user.
Specifically, the expression is an emotion of thought expressed on the face, and may be, for example, panic, excitement, anger, or the like. The face feature is a feature related to a face, and may be, for example, a feature corresponding to eyes, a feature corresponding to mouth, and a feature corresponding to nose. The face features may be extracted by a face feature extraction model, which may be a deep learning model. A plurality of face feature extraction models may be included, for example, at least one of a model that extracts features corresponding to eyes or a model that extracts features corresponding to the mouth may be included. The face feature extraction model and the expression recognition model can be cascaded, and are obtained by combined training during model training. For example, the training image may be input into a face feature extraction model to obtain face features, and the face features may be input into an expression recognition model to obtain a predicted expression. And obtaining a model loss value according to the difference between the predicted expression and the actual expression, and adjusting parameters of the model according to the gradient descent method. Wherein the difference between the predicted expression and the actual expression is in positive correlation with the model loss value. Therefore, the facial feature extraction model and the expression recognition model can be quickly obtained through combined training.
Step S208, a plurality of voice features corresponding to the target specific person are obtained based on the specific person voice, and a voice feature set is obtained.
Where a speech feature is a feature used to represent a speech characteristic. For example, may include at least one of intonation or speech rate. Intonation refers to the change in sound rise and fall in a sentence. For example, it may be raised, lowered, or suddenly raised or lowered, etc. One or more speech features may be included in the set of speech features. The variation of the frequency of the voice can be counted to obtain the intonation feature.
Specifically, the server may perform speech feature recognition on a specific person's speech by using a natural language processing technique, to obtain a speech feature set. For example, the server acquires voice attribute information corresponding to the target specific person based on the specific person voice, the voice attribute information including at least one of speech speed information or intonation variation information. The intonation change information may be counted in units of a preset time length, for example, an average voice frequency corresponding to a time period corresponding to each preset time length may be calculated, and the intonation is determined according to a change of the average voice frequency between adjacent time periods. For example, assuming that the preset time period is 1 second, the average voice frequency corresponding to the 1st second, the average voice frequency corresponding to the 2 nd second, and the average voice frequency corresponding to the 2 nd second may be obtained, and when the tone change information is a tone increase.
Step S210, combining the features in the behavior feature set and the features in the voice feature set to obtain a first combined feature.
Specifically, when the voice features in the behavior feature set are combined, the voice features may be all combined together or may be partially combined together. For example, there may be a plurality of first combined features. A first set of combined features includes at least one feature of the set of behavioral features and at least one feature of the set of speech features. The combined features are used as a whole to process the features.
Specifically, the server may obtain at least one behavioral feature from the behavioral feature set, obtain at least one speech feature from the speech feature set, and combine the obtained features to obtain the first combined feature.
In some embodiments, the specific person voice and the specific person image exist for a certain time period, and when combined, the behavioral characteristics and the voice characteristics in the same time range can be combined, so that the psychological state of the target specific person, such as a criminal suspicion, in the same time period can be represented.
In some embodiments, when combining, the behavioral characteristics in the first time period may be combined with the speech characteristics in the second time period, where the first time period is different from the second time period, and the first time period and the second time period may be adjacent time periods. By combining the speech features and behavioral features of adjacent time periods, psychological activities of a particular person, such as a criminal suspect, can be reflected. For example, a person may lie faster in speech, and often will do some specific activity, such as touching the nose, before or after the person lies, so that the combination of speech features and behavioral features for adjacent time periods can further reflect the criminal suspicion's mental activity to determine if he lies.
Step S212, determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
The target reply intention refers to the intention in reply, and the reply intention is used for reflecting the reality degree of the reply. The target reply intent may be lie or real. The target reply intention corresponding to the first combination feature may be obtained according to a predetermined judgment rule. For example, it may be set that when the speech rate is higher than the preset speech rate, and there is a behavior of touching the nose in an adjacent period after the speech rate is higher than the preset speech rate, it is determined that the target answer is intended to lie.
According to the digital person-based specific person query method, after the first query sentence is output, the specific person voice and the specific person image corresponding to the target specific person in response to the first query sentence are obtained to determine the behavior feature and the voice feature corresponding to the target specific person.
In some embodiments, the server may input the first combined feature into a trained intent recognition model, where the intent recognition model processes the first combined feature using model parameters corresponding to the first combined feature to obtain a target reply intent corresponding to the target specific person. The intent recognition model may be, for example, a neural network model. The intention recognition model can be obtained through supervised training, and model parameters corresponding to the first combination features are obtained through model training.
In some embodiments, when the first combined feature includes a plurality, the corresponding target reply intent may be determined based on a judgment rule corresponding to each of the first combined features. And determining the final target reply intention by integrating the target reply intents corresponding to the first combination features. For example, the target reply intents corresponding to the first combination feature may be counted, and the most number of intents may be used as the final target reply intents. For example, assuming that there are a total of 5 first combination features, assuming that there are 4 target replies corresponding to the first combination features intended to lie, and 1 target reply corresponding to the first combination features intended to say real words, the final target reply is intended to lie. The final target reply intention is determined through the target reply intention corresponding to each first combination feature, which is equivalent to multi-level analysis, so that the accuracy of intention analysis is improved.
In some embodiments, when determining the target reply intention, the reply intention corresponding to the forward query statement may be acquired as the forward reply intention, and the target reply intention corresponding to the target specific person may be determined according to the forward reply intention. Wherein, the forward inquiry sentence refers to a sentence of inquiring a target specific person before the first inquiry sentence. For example, the number of reply intentions that are lie (referred to as the forward lie number) may be acquired from reply intentions corresponding to each combination feature corresponding to the forward query sentence, and when the number of target reply intentions that are lie corresponding to the first combination feature is greater than the forward lie number, it is determined that the target specific person lies. For example, assuming that the reply intention corresponding to 3 combination features is lie at the last time of reply intention recognition, this time, the reply intention corresponding to 4 combination features is lie, that means that the target specific user is likely to be lie.
In some embodiments, in the process of inquiring, the digital person may acquire a specific person image and specific person voice for a preset time period, for example, 10 minutes, and may generate an intention result corresponding to the target specific person by using the acquired voice feature and behavior feature, and the terminal may also display the intention result to an inquired police, for example, display the intention result in a terminal corresponding to the police, so that they can adjust the inquiring method conveniently.
In some embodiments, the intention recognition result (called an intermediate intention recognition result) obtained in each time period may be further stored, and when the inquiry is finished, a final intention analysis result is obtained based on the intermediate graph recognition result, namely, a final criminal suspects intention summary result is obtained. For example, a change rule of the intermediate intention recognition result may be acquired, and the change rule may be output. The law of change may be, for example, lie almost all over the first 20 minutes, and the actual speech is gradually spoken after 20 minutes. The multi-level analysis by using the intermediate intention recognition result can make the final result more accurate, and is also beneficial to the police inquiring the method of adjusting the inquiry in the middle to infer the psychological activity state of the criminal suspects so as to better execute the inquiry.
In some embodiments, a law of change of reply intent may be obtained. For example, the server may also output probabilities corresponding to reply intentions, determine probability change rules, and output psychological state change conditions of a specific person according to the probability change rules. For example, assuming that the probability of speaking a real word becomes gradually larger, the result of determining the psychological state change is gradually inclined to cooperate. Assuming that the probability of lying down becomes progressively larger, the result of the mental state change is determined to be progressively more prone to resist interrogation.
In some embodiments, as shown in fig. 3, the digital person-based person-specific interrogation method may further comprise the steps of:
Step S302, semantic analysis is carried out on the voice of the specific person, and target semantics corresponding to the voice of the specific person are obtained.
The semantics refer to meanings expressed in the reply sentence. For example, for a criminal suspect's reply, the target semantic may be an acknowledgement of the crime fact or whether the crime fact was acknowledged. Target semantics can be obtained based on the recognition of a semantic recognition model, and the voice recognition model is an artificial intelligence model.
Step S304, combining the target semantics and intonation change information in the voice attribute information to obtain a second combined feature.
Step S212, namely, determining, based on the first combined feature, a target reply intention corresponding to the target specific person includes: and determining the target reply intention corresponding to the target specific person based on the first combined feature and the second combined feature.
Specifically, the tone changes corresponding to different semantics are different, the semantics and the tone change information are combined to obtain the second combination feature, so that psychological state activities which are shown by the language when the target specific person expresses the meaning which the target specific person wants to express can be mined, and therefore, the target reply intention of the target specific person is determined based on the second combination feature, for example, the first combination feature and the second combination feature are input into the intention recognition model to obtain the target reply intention, and the accuracy of the obtained target reply intention can be improved.
In some embodiments, the digital person-based person-specific interrogation method may further comprise the steps of: and determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
The query policy refers to a policy to be adopted by the query. The policy may include at least one of a intonation of the question, a manner of the question, an image of the question, or a type of question being asked. The correspondence between the reply intention and the query policy may be preset, so that the corresponding target query policy may be obtained according to the target reply intention. And inquiring the target specific person according to the target inquiry strategy. By taking different reply strategies for different reply intents, the interrogation may be made more efficient.
For example, assuming that the answer is intended to lie, the intonation and severity of the query may be increased so that the criminal suspects may feel stress and obtain query sentences that are similar to the semantics of the first query sentence and ask the target specific person to answer some more targeted questions to increase the query effect.
For another example, assume that the answer is intended to say a real word. Then query statements that are semantically similar to the first query statement may be skipped and additional questions retrieved for querying.
For another example, assuming that the answer is intended to lie, the way of asking the question may be changed, instead asking the question by a real person, such as police, to make the question more efficient.
In some embodiments, determining a target query policy based on a target reply intent corresponding to the target specific person, querying the target specific person based on the target query policy includes: acquiring an virtual image corresponding to a digital person; determining corresponding target image adjustment parameters based on target reply intentions corresponding to target specific persons; and adjusting the avatar according to the target avatar adjusting parameters, and controlling the avatar after the avatar adjustment to inquire the target specific person.
In particular, a avatar means that the character is virtually available and not a real user character, for example, a cartoon character. The character adjustment parameter refers to a parameter for adjusting the avatar, and includes at least one of a character adjustment parameter corresponding to a face or a character adjustment parameter corresponding to a gesture, for example. Different reply intents may correspond to different avatar adjustment parameters. The correspondence between the reply intention and the character adjustment parameter is preset. For example, when the target answer is intended to say a true word, then the character adjustment parameter is a mild character adjustment parameter for adjusting the character to a mild state. When the target answer is intended to be in a panic, then the character adjustment parameter is a strict character adjustment parameter for adjusting the character to a strict state. As a practical example, the mild character adjustment parameter may include a character adjustment parameter that adjusts a face to smile. The strict character adjustment parameters may include character adjustment parameters that adjust the face to a strict state of mind and character adjustment parameters that adjust the gesture to a table-slapping gesture. When the avatar is adjusted, the avatar adjusted may be controlled to inquire about the target specific person.
In some embodiments, as shown in fig. 4A, determining a target query policy based on a target reply intent corresponding to a target specific person, querying the target specific person based on the target query policy includes the steps of:
Step S402, a second query sentence for querying the target specific person is acquired.
Specifically, the second query term may be selected randomly or may be determined according to the intention of the target reply. For example, assuming that the target answer is intended to lie, a statement that is semantically similar to the first query statement is obtained as a second query statement to further determine facts through semantically similar questions. Assuming that the target reply is intended to say a real word, a second inquiry sentence can be acquired according to a preset sentence inquiry sequence to inquire a criminal suspects.
In some embodiments, obtaining a second query sentence that queries a target specific person includes: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
Specifically, the server can recognize and obtain the meaning expressed by the voice of the specific person according to the semantic recognition model, namely, the target semantic is obtained. For example, whether the crime facts are acknowledged or not is acknowledged, the query sentence library may store query sentences corresponding to the semantics, that is, the corresponding relation between the semantics and the query sentences, so that the corresponding query sentences can be obtained as the second query sentence according to the target semantics. By acquiring corresponding target inquiry sentences in the inquiry sentence library according to the meaning expressed when the target specific person, such as a criminal suspects, replies of the target specific person can be flexibly inquired by the digital person, and the inquiry effect is improved.
For example, assuming that the target semantic is that a criminal suspect indicates that he is at a meal at a restaurant at the time, query sentences relating to the meal at the restaurant, such as query sentences that query restaurant-related information or query sentences that eat details, may be obtained. For example, a query asking for the name of a restaurant or a query asking for dishes to be ordered at a restaurant.
In some embodiments, a target entity in the target semantic may be acquired, and an inquiry sentence corresponding to the target entity is acquired as the second inquiry sentence. For example, the associated entity with the association corresponding to the target entity can be obtained from the knowledge graph, and the corresponding second query sentence is obtained according to the associated entity. For example, assume that the target semantic is "i am going to xx park" on the day, assume that in the knowledge graph, the entry corresponding to "xx park" includes "a entry" and "B entry", and that according to the knowledge graph, the attribute of "a entry" and "B entry" is "entry", so that a problem related to "entry" such as "you are entering park from a entry or B entry" can be acquired.
Step S404, corresponding target inquiry intonation is determined according to the target reply intention.
Specifically, the correspondence between the reply intent and the query intonation is preset, so that after the server obtains the target reply intent, the corresponding target query intonation can be obtained according to the correspondence. For example, it may be set that when the answer is intended to lie, then the query tone acquired is an up-tone. When the answer is intended to say a real speech, then the intonation corresponding to the usual mood, for example, the level intonation, is obtained.
Step S406, obtaining the target query speech according to the second query sentence and the target query intonation.
Specifically, when the target query speech is obtained, the second query sentence may be read by using the target query speech according to the speech synthesis technology, so as to obtain the target query speech.
In some embodiments, obtaining the target query speech from the second query sentence and the target query intonation includes obtaining background attribute information corresponding to the target specific person; modifying the second query sentence according to the background attribute information to obtain a modified second query sentence; and obtaining the target query voice according to the modified second query sentence and the target query intonation.
The background attribute information is attribute information indicating a background of the user, such as an age, occupation, or hobbies. After the background attribute information is obtained, the second query sentence can be modified based on the background attribute information, so that the modified second query sentence is more matched with the background of the target specific person. Thus improving the validity of the interrogation.
In some embodiments, nouns in the second query sentence may be modified based on the context attribute information. For example, for criminal suspects of different professions, nouns in the second query statement may be modified from profession to profession. For example, a sentence generation model may be acquired, the sentence generation model is used to generate one sentence from another sentence, the second query sentence and the background attribute information may be input into the sentence generation model, and the sentence generation model rewrites the second query sentence according to the background attribute information to generate a semantically approximate query sentence as the modified second query sentence. By modifying the second training sentence according to the background attribute information, a modified second query sentence with the matched background attribute information can be obtained, so that the target specific person can better understand the query sentence.
Step S408, outputting the target query speech.
Specifically, the server may transmit a target query voice to the terminal, which may play the target query voice through the voice playing device.
The method provided by the embodiment of the application can be applied to a scene of inquiring criminal suspects, the conventional inquiry methods of the criminal suspects are usually face-to-face investigation inquiry of criminal suspects, and the criminal suspects are inquired by utilizing knowledge and experience obtained by professional training of police so as to obtain real and complete information. But high quality police resources are scarce. Trained criminal suspects have the potential to deceive and conceal facts, and they may be unlawful by exploiting the weaknesses of real police. A virtual digital police can be created to interrogate criminal suspects. The virtual digital police can collect video information and sound information of criminal suspects through a video and voice collecting device for subsequent reply intention recognition. The virtual digital police can also display inquiry sentences through video and voice playing equipment so as to interact with criminal suspects. The pictures and video evidence can be displayed through the video equipment.
In addition, the criminal suspects can be inquired by taking mass data such as a complete knowledge graph, a criminal database and the like as the basis for analyzing reply intention of the virtual digital police. For example, the server may analyze the authenticity of a criminal suspect speaking in conjunction with a complete psychology and criminal psychology database provided by the digital person, and may adjust the strategy of asking questions, such as speech, based on the target reply intent obtained. Because the digital person has neutrality, the digital person cannot be confused by exaggerated performance of criminal suspects, so that real and effective information can be obtained, the digital person has strong replicability, and the digital person can be popularized to various police departments, so that each police department has digital persons with the same super-strong capability.
The digital person of the embodiment of the application can represent the avatar, and can adjust the avatar according to the target reply intention corresponding to the target specific person, for example, when the reply intention is lie, the avatar is adjusted to be strict. As shown in fig. 4B, which is a schematic diagram of an interface for querying a digital person, in some embodiments, a suspected person may display an avatar on the query interface, and when no lie of the target specific person is detected, the avatar is in a normal state, and when lie of the target specific person is detected, the avatar is adjusted to a strict state, and the terminal may control the avatar to send out a voice query sentence, for example, "what entrance you enter the park in that day? ".
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.
In one embodiment, as shown in FIG. 5, there is provided a digital person-based person-specific interrogation apparatus comprising: a first query statement output module 502, an information acquisition module 504, a behavioral characteristic set derivation module 506, a speech characteristic set derivation module 508, a first combination module 510, and a target reply intent determination module 512, wherein:
the first query sentence output module 502 is configured to output a first query sentence.
The information obtaining module 504 is configured to obtain a specific person voice and a specific person image corresponding to the target specific person replying to the first query sentence.
The behavior feature set obtaining module 506 is configured to obtain a plurality of behavior features corresponding to the specific person based on the specific person image, and obtain a behavior feature set.
The voice feature set obtaining module 508 is configured to obtain a plurality of voice features corresponding to the target specific person based on the specific person voice, and obtain a voice feature set.
The first combination module 510 is configured to combine the features in the behavioral feature set with the features in the speech feature set to obtain a first combined feature.
The target reply intention determining module 512 is configured to determine a target reply intention corresponding to the target specific person based on the first combined feature.
In some embodiments, the speech feature set derivation module is to: and acquiring voice attribute information corresponding to the target specific person based on the specific person voice, wherein the voice attribute information comprises at least one of speech speed information and intonation change information.
In some embodiments, the speech feature set derivation module is to: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; the specific person inquiring device based on the digital person further comprises a second combining module, which is used for combining the target semantics and the intonation change information in the voice attribute information to obtain a second combined characteristic; the target reply intention determining module is used for: and determining the target reply intention corresponding to the target specific person based on the first combined feature and the second combined feature.
In some embodiments, the behavior feature set derivation module is to: and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by utilizing the trained expression recognition model to obtain the target expression corresponding to the target user.
In some embodiments, the digital person-based persona-specific query device further includes a target query policy determination module for determining a target query policy based on a target reply intent corresponding to the target persona, the target persona being queried based on the target query policy.
In some embodiments, as shown in fig. 6, the target interrogation policy determination module includes:
a second inquiry sentence acquisition unit 602 for acquiring a second inquiry sentence for inquiring about the target specific person.
The target query intonation determining unit 604 is configured to determine a corresponding target query intonation according to the target reply intention.
The target query speech obtaining unit 606 is configured to obtain a target query speech according to the second query sentence and the target query intonation.
The target query speech output unit 608 outputs a target query speech.
In some embodiments, the second query statement obtaining unit is configured to: carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
In some embodiments, the target query speech deriving unit is configured to: obtaining background attribute information corresponding to a specific target person; modifying the second query sentence according to the background attribute information to obtain a modified second query sentence; and obtaining the target query voice according to the modified second query sentence and the target query intonation.
In some embodiments, the target interrogation policy determination module is to: acquiring an virtual image corresponding to the digital person; determining corresponding target image adjustment parameters based on target reply intentions corresponding to the target specific person; and carrying out image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
In some embodiments, the target reply intent determination module is to: inputting the first combined features into a trained intention recognition model, and processing the first combined features by using model parameters corresponding to the first combined features by the intention recognition model to obtain target reply intentions corresponding to target specific persons.
For specific limitations on the digital person-based person-specific interrogation method apparatus, reference may be made to the above limitations on the digital person-based person-specific interrogation method, which are not described in detail herein. The individual modules in the above-described digital person-based person-specific interrogation method apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of computer devices is used to store digital person-based specific person interrogation method data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a digital person-based person-specific interrogation method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: outputting a first query sentence; acquiring specific person voice corresponding to a target specific person replying to the first inquiry sentence and a specific person image; acquiring a plurality of behavior features corresponding to a target specific person based on the specific person image to obtain a behavior feature set; acquiring a plurality of voice features corresponding to a target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: outputting a first query sentence; acquiring specific person voice corresponding to a target specific person replying to the first inquiry sentence and a specific person image; acquiring a plurality of behavior features corresponding to a target specific person based on the specific person image to obtain a behavior feature set; acquiring a plurality of voice features corresponding to a target specific person based on the specific person voice to obtain a voice feature set; combining the features in the behavior feature set with the features in the voice feature set to obtain a first combined feature; and determining the target reply intention corresponding to the target specific person based on the first combined characteristic.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (12)

1. A method of person-specific interrogation based on digital persons, the method comprising:
Outputting a first query sentence;
Acquiring specific person voice corresponding to the target specific person replying to the first inquiry sentence and a specific person image;
Acquiring a plurality of behavior features corresponding to the target specific person based on the specific person image to obtain a behavior feature set;
Acquiring a plurality of voice features corresponding to the target specific person based on the specific person voice to obtain a voice feature set; the voice feature set comprises voice attribute information, wherein the voice attribute information comprises at least one of speech speed information or intonation change information;
Carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; combining the target semantics with intonation change information in the voice attribute information to obtain a second combined feature;
Combining the features in the behavior feature set and the features in the voice feature set in the adjacent time period to obtain a first combined feature;
And determining the target reply intention corresponding to the target specific person based on the first combination feature and the second combination feature.
2. The method of claim 1, wherein the obtaining a set of behavioral characteristics based on the image of the particular person to obtain a plurality of behavioral characteristics corresponding to the target particular person comprises:
and acquiring the facial features corresponding to the target specific person based on the specific person image, and processing the facial features by using a trained expression recognition model to obtain the target expression corresponding to the target user.
3. The method according to claim 1, wherein the method further comprises:
And determining a target inquiry strategy based on the target reply intention corresponding to the target specific person, and inquiring the target specific person based on the target inquiry strategy.
4. A method according to claim 3, wherein said determining a target query policy based on a target reply intent corresponding to said target specific person, querying said target specific person based on said target query policy comprises:
Acquiring a second inquiry sentence for inquiring the target specific person;
Determining a corresponding target inquiry intonation according to the target reply intention;
obtaining target inquiry voice according to the second inquiry sentence and the target inquiry intonation;
and outputting the target inquiry voice.
5. The method of claim 4, wherein the obtaining a second query statement that queries the target-specific person comprises:
Carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice;
and acquiring the query sentence corresponding to the target semantic from the query sentence library as a second query sentence.
6. The method of claim 4, wherein said deriving a target query speech from said second query sentence and said target query intonation comprises:
obtaining background attribute information corresponding to the target specific person;
Modifying the second query sentence according to the background attribute information to obtain a modified second query sentence;
And obtaining target query voice according to the modified second query sentence and the target query intonation.
7. A method according to claim 3, wherein said determining a target query policy based on a target reply intent corresponding to said target specific person, querying said target specific person based on said target query policy comprises:
acquiring an virtual image corresponding to the digital person;
determining corresponding target image adjustment parameters based on target reply intentions corresponding to the target specific person;
And carrying out image adjustment on the virtual image according to the target image adjustment parameters, and controlling the virtual image after image adjustment to inquire the target specific person.
8. The method of claim 1, wherein the determining a target reply intent corresponding to the target-specific person based on the first combined feature comprises:
and inputting the first combined features into a trained intention recognition model, and processing the first combined features by using model parameters corresponding to the first combined features by the intention recognition model to obtain target reply intentions corresponding to the target specific person.
9. A digital person-based person-specific interrogation device, the device comprising:
The first inquiry sentence output module is used for outputting a first inquiry sentence;
The information acquisition module is used for acquiring specific person voice and specific person images corresponding to the target specific person replying to the first inquiry statement;
The behavior feature set obtaining module is used for obtaining a plurality of behavior features corresponding to the target specific person based on the specific person image to obtain a behavior feature set;
the voice feature set obtaining module is used for obtaining a plurality of voice features corresponding to the target specific person based on the specific person voice to obtain a voice feature set; the voice feature set comprises voice attribute information, wherein the voice attribute information comprises at least one of speech speed information or intonation change information; carrying out semantic analysis on the specific person voice to obtain target semantics corresponding to the specific person voice; combining the target semantics with intonation change information in the voice attribute information to obtain a second combined feature;
The first combination module is used for combining the features in the behavior feature set and the features in the voice feature set in the adjacent time period to obtain first combination features;
and the target reply intention determining module is used for determining the target reply intention corresponding to the target specific person based on the first combined characteristic and the second combined characteristic.
10. The apparatus of claim 9, wherein the apparatus further comprises:
And the target inquiry strategy determining module is used for determining a target inquiry strategy based on the target reply intention corresponding to the target specific person and inquiring the target specific person based on the target inquiry strategy.
11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
CN202010847705.2A 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person Active CN112151027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010847705.2A CN112151027B (en) 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010847705.2A CN112151027B (en) 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person

Publications (2)

Publication Number Publication Date
CN112151027A CN112151027A (en) 2020-12-29
CN112151027B true CN112151027B (en) 2024-05-03

Family

ID=73888942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010847705.2A Active CN112151027B (en) 2020-08-21 2020-08-21 Method, device and storage medium for querying specific person based on digital person

Country Status (1)

Country Link
CN (1) CN112151027B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409822B (en) * 2021-05-31 2023-06-20 青岛海尔科技有限公司 Object state determining method and device, storage medium and electronic device
CN113724705B (en) * 2021-08-31 2023-07-25 平安普惠企业管理有限公司 Voice response method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705357A (en) * 2017-09-11 2018-02-16 广东欧珀移动通信有限公司 Lie detecting method and device
CN109829358A (en) * 2018-12-14 2019-05-31 深圳壹账通智能科技有限公司 Micro- expression loan control method, device, computer equipment and storage medium
CN110427472A (en) * 2019-08-02 2019-11-08 深圳追一科技有限公司 The matched method, apparatus of intelligent customer service, terminal device and storage medium
CN110689889A (en) * 2019-10-11 2020-01-14 深圳追一科技有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111429267A (en) * 2020-03-26 2020-07-17 深圳壹账通智能科技有限公司 Face examination risk control method and device, computer equipment and storage medium
CN111523981A (en) * 2020-04-29 2020-08-11 深圳追一科技有限公司 Virtual trial method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256267B (en) * 2017-06-19 2020-07-24 北京百度网讯科技有限公司 Query method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705357A (en) * 2017-09-11 2018-02-16 广东欧珀移动通信有限公司 Lie detecting method and device
CN109829358A (en) * 2018-12-14 2019-05-31 深圳壹账通智能科技有限公司 Micro- expression loan control method, device, computer equipment and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN110427472A (en) * 2019-08-02 2019-11-08 深圳追一科技有限公司 The matched method, apparatus of intelligent customer service, terminal device and storage medium
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN110689889A (en) * 2019-10-11 2020-01-14 深圳追一科技有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111429267A (en) * 2020-03-26 2020-07-17 深圳壹账通智能科技有限公司 Face examination risk control method and device, computer equipment and storage medium
CN111523981A (en) * 2020-04-29 2020-08-11 深圳追一科技有限公司 Virtual trial method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112151027A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
EP3617946B1 (en) Context acquisition method and device based on voice interaction
CN111368609B (en) Speech interaction method based on emotion engine technology, intelligent terminal and storage medium
JP6986527B2 (en) How and equipment to process video
JP6951712B2 (en) Dialogue devices, dialogue systems, dialogue methods, and programs
CN108538298B (en) Voice wake-up method and device
US20190109878A1 (en) Natural language processing artificial intelligence network and data security system
WO2016172872A1 (en) Method and device for verifying real human face, and computer program product
CN108920640B (en) Context obtaining method and device based on voice interaction
CN110634472B (en) Speech recognition method, server and computer readable storage medium
US20230206928A1 (en) Audio processing method and apparatus
US11928985B2 (en) Content pre-personalization using biometric data
CN112151027B (en) Method, device and storage medium for querying specific person based on digital person
CN112418059B (en) Emotion recognition method and device, computer equipment and storage medium
US20210211575A1 (en) Adjusting image capture parameters via machine learning
WO2021137997A1 (en) Machine learning models based on altered data and systems and methods for training and using the same
CN107229691B (en) Method and equipment for providing social contact object
CN116414959A (en) Digital person interaction control method and device, electronic equipment and storage medium
CN110909218A (en) Information prompting method and system in question-answering scene
Tsai et al. Sentiment analysis of pets using deep learning technologies in artificial intelligence of things system
CN114155460A (en) Method and device for identifying user type, computer equipment and storage medium
Firc Applicability of Deepfakes in the Field of Cyber Security
Akrout Deep facial emotion recognition model using optimal feature extraction and dual‐attention residual U‐Net classifier
US20200402511A1 (en) System and method for managing audio-visual data
CN109190556B (en) Method for identifying notarization will authenticity
CN111971670B (en) Generating a response in a dialog

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant