CN112307948A - Feature fusion method, device and storage medium - Google Patents

Feature fusion method, device and storage medium Download PDF

Info

Publication number
CN112307948A
CN112307948A CN202011181418.9A CN202011181418A CN112307948A CN 112307948 A CN112307948 A CN 112307948A CN 202011181418 A CN202011181418 A CN 202011181418A CN 112307948 A CN112307948 A CN 112307948A
Authority
CN
China
Prior art keywords
target
preset
fusion
feature
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011181418.9A
Other languages
Chinese (zh)
Inventor
李剑
苟巍
沈海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN202011181418.9A priority Critical patent/CN112307948A/en
Publication of CN112307948A publication Critical patent/CN112307948A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • G06Q50/265Personal security, identity or safety
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • G06V10/507Summing image-intensity values; Histogram projection analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Tourism & Hospitality (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Computer Security & Cryptography (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The method comprises the steps of obtaining target image characteristics and target voice characteristics of a target to be detected, then fusing the characteristics by using a fusion model to obtain fusion characteristics of the target to be detected, and detecting the target to be detected by using the fusion characteristics. When the target detection is carried out, the image characteristics of the target to be detected and the voice characteristics of the target to be detected are considered, namely, the characteristics of the target to be detected are obtained and used for detecting the subsequent target, the problem that the error rate of target detection is high in the existing image acquired by using the terminal equipment is solved, the image characteristics and the voice characteristics of the target to be detected are fused, information contained in the fused characteristics is rich, and the accuracy of the detection result of the subsequent target is further improved.

Description

Feature fusion method, device and storage medium
Technical Field
The present application relates to object detection technologies, and in particular, to a feature fusion method and apparatus, and a storage medium.
Background
With the continuous development of economic technology, people's mode of going out is more and more diversified, for example, go out through operation vehicles such as taxi, net car of appointment etc.. However, the above travel mode brings convenience to people and also creates some new problems. For example, in the case of a net appointment, a driver and passengers may collide during the vehicle driving process, which may cause the driver to drive the vehicle in an abnormal state, so that certain potential safety hazards exist during the daily trip of the passengers using the net appointment.
In order to solve the above problem, in the related art, it is common that a terminal device captures an image inside a vehicle, and performs target detection based on the captured image. For example, the terminal device is a mobile phone of a driver, and the mobile phone captures an image in the vehicle during driving of the vehicle, and then detects the driver and a passenger in the vehicle based on the captured image, for example, determines whether the driver and the passenger conflict.
However, the error of the image acquired by the terminal device is large, which results in a high error rate of the subsequent target detection based on the acquired image, so that the potential safety hazard cannot be found in time, and the correct intervention on the problems occurring in the driving of the vehicle cannot be performed.
Disclosure of Invention
In order to solve the problems in the prior art, the present application provides a feature fusion method, device and storage medium.
In a first aspect, an embodiment of the present application provides a feature fusion method, where the method includes:
acquiring target characteristics of a target to be detected, wherein the target characteristics comprise target image characteristics and target voice characteristics;
inputting the target features into a preset fusion model, wherein the preset fusion model is obtained through reference feature and reference fusion feature training, and the reference features comprise reference image features and reference voice features;
and obtaining the target fusion characteristics of the target to be detected according to the output of the preset fusion model.
In a possible implementation manner, before the inputting the target feature into the preset fusion model, the method further includes:
determining dimensions of the target feature;
if the dimension is different from a preset feature dimension input into the preset fusion model, performing dimension splitting on the target feature according to the preset feature dimension;
the inputting the target feature into a preset fusion model comprises:
and respectively inputting the target characteristics after dimension splitting into the preset fusion model.
In a possible implementation manner, the obtaining the target fusion feature of the target to be detected according to the output of the preset fusion model includes:
acquiring fusion characteristics corresponding to the dimension-split target characteristics output by the preset fusion model;
and carrying out dimension combination on the obtained fusion features to obtain the target fusion features.
In a possible implementation manner, the dimension splitting is performed on the target feature according to the preset feature dimension;
determining the number of rows and columns of the preset feature dimension;
splitting the line number of the target feature according to the line number of the preset feature dimension, so that the line number of the split target feature is equal to the line number of the preset feature dimension;
and/or splitting the column number of the target feature according to the column number of the preset feature dimension, so that the column number of the split target feature is equal to the column number of the preset feature dimension.
In a possible implementation manner, before the inputting the target feature into the preset fusion model, the method further includes:
inputting the reference features into the preset fusion model;
determining fusion accuracy according to the fusion features output by the preset fusion model and the reference fusion features;
if the fusion accuracy is lower than a preset accuracy threshold, adjusting the preset fusion model according to the fusion accuracy to improve the fusion accuracy, taking the adjusted preset fusion model as a new preset fusion model, and re-executing the step of inputting the reference features into the preset fusion model.
In a possible implementation manner, the acquiring target features of the target to be detected includes:
inputting a target image of the target to be detected into a first preset model, and inputting target voice of the target to be detected into a second preset model, wherein the first preset model is obtained by training reference images and reference image characteristics, and the second preset model is obtained by training reference voice and reference voice characteristics;
and acquiring the target image characteristics output by the first preset model and the target voice characteristics output by the second preset model.
In one possible implementation, the target features further include target text features; the method for acquiring the target characteristics of the target to be detected comprises the following steps:
inputting a target text of the target to be detected into a third preset model, wherein the third preset model is obtained through training of reference texts and reference text characteristics;
and acquiring the target text features output by the third preset model.
In a possible implementation manner, after obtaining the target fusion feature of the target to be detected according to the output of the preset fusion model, the method further includes:
and detecting the target to be detected according to the target fusion characteristics.
In a possible implementation manner, the detecting the target to be detected according to the target fusion feature includes:
inputting the target fusion characteristics into a fourth preset model, wherein the fourth preset model is obtained by reference fusion characteristics and reference state training;
and acquiring the target state of the target to be detected output by the fourth preset model.
In a possible implementation manner, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, the method further includes:
and receiving the target image sent by a preset image acquisition device.
In a possible implementation manner, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, the method further includes:
and receiving the target image sent by the terminal equipment of the target to be detected.
In a possible implementation manner, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, the method further includes:
and receiving the target voice sent by a preset voice acquisition device.
In a possible implementation manner, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, the method further includes:
and receiving the target voice sent by the terminal equipment of the target to be detected.
In a possible implementation manner, before the inputting the target text of the target to be detected into the third preset model, the method further includes:
and receiving the voice of the target to be detected sent by a preset voice acquisition device, and converting the received voice into the target text.
In a possible implementation manner, before the inputting the target text of the target to be detected into the third preset model, the method further includes:
and receiving the voice of the target to be detected sent by the terminal equipment of the target to be detected, and converting the received voice into the target text.
In a possible implementation manner, before the inputting the target text of the target to be detected into the third preset model, the method further includes:
and receiving the target text sent by the terminal equipment of the target to be detected.
In a second aspect, an embodiment of the present application provides a feature fusion apparatus, including:
the characteristic acquisition module is used for acquiring target characteristics of a target to be detected, and the target characteristics comprise target image characteristics and target voice characteristics;
the feature input module is used for inputting the target features into a preset fusion model, wherein the preset fusion model is obtained through reference feature and reference fusion feature training, and the reference features comprise reference image features and reference voice features;
and the fusion characteristic obtaining module is used for obtaining the target fusion characteristics of the target to be detected according to the output of the preset fusion model.
In one possible implementation manner, the feature input module is further configured to:
determining dimensions of the target feature;
if the dimension is different from a preset feature dimension input into the preset fusion model, performing dimension splitting on the target feature according to the preset feature dimension;
the feature input module is specifically configured to:
and respectively inputting the target characteristics after dimension splitting into the preset fusion model.
In a possible implementation manner, the fused feature obtaining module is specifically configured to:
acquiring fusion characteristics corresponding to the dimension-split target characteristics output by the preset fusion model;
and carrying out dimension combination on the obtained fusion features to obtain the target fusion features.
In a possible implementation manner, the feature input module is specifically configured to:
determining the number of rows and columns of the preset feature dimension;
splitting the line number of the target feature according to the line number of the preset feature dimension, so that the line number of the split target feature is equal to the line number of the preset feature dimension;
and/or splitting the column number of the target feature according to the column number of the preset feature dimension, so that the column number of the split target feature is equal to the column number of the preset feature dimension.
In one possible implementation manner, the feature input module is further configured to:
inputting the reference features into the preset fusion model;
determining fusion accuracy according to the fusion features output by the preset fusion model and the reference fusion features;
if the fusion accuracy is lower than a preset accuracy threshold, adjusting the preset fusion model according to the fusion accuracy to improve the fusion accuracy, taking the adjusted preset fusion model as a new preset fusion model, and re-executing the step of inputting the reference features into the preset fusion model.
In a possible implementation manner, the feature obtaining module is specifically configured to:
inputting a target image of the target to be detected into a first preset model, and inputting target voice of the target to be detected into a second preset model, wherein the first preset model is obtained by training reference images and reference image characteristics, and the second preset model is obtained by training reference voice and reference voice characteristics;
and acquiring the target image characteristics output by the first preset model and the target voice characteristics output by the second preset model.
In one possible implementation, the target features further include target text features; the feature obtaining module is further configured to:
inputting a target text of the target to be detected into a third preset model, wherein the third preset model is obtained through training of reference texts and reference text characteristics;
and acquiring the target text features output by the third preset model.
In a possible implementation manner, the fused feature obtaining module is further configured to:
and detecting the target to be detected according to the target fusion characteristics.
In a possible implementation manner, the fused feature obtaining module is specifically configured to:
inputting the target fusion characteristics into a fourth preset model, wherein the fourth preset model is obtained by reference fusion characteristics and reference state training;
and acquiring the target state of the target to be detected output by the fourth preset model.
In a third aspect, an embodiment of the present application provides a feature fusion device, including:
a processor;
a memory; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method according to the first aspect.
According to the feature fusion method, the feature fusion device and the storage medium, the target image feature and the target voice feature of the target to be detected are obtained, then the fusion model is used for fusing the features to obtain the fusion feature of the target to be detected, and therefore the fusion feature is used for detecting the target to be detected. When the target detection is carried out, the image characteristics of the target to be detected and the voice characteristics of the target to be detected are considered, namely, the characteristics of the target to be detected are obtained and used for detecting the subsequent target, the problem that the error rate of target detection is high in the existing image acquired by using the terminal equipment is solved, the image characteristics and the voice characteristics of the target to be detected are fused, information contained in the fused characteristics is rich, and the accuracy of the detection result of the subsequent target is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic diagram of a feature fusion system architecture provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a feature fusion method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a preset fusion model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a process for feature fusion using a preset fusion model according to an embodiment of the present application;
fig. 5 is a schematic diagram of a training process of a preset fusion model according to an embodiment of the present disclosure;
fig. 6 is a schematic flowchart of another feature fusion method provided in the embodiment of the present application;
FIG. 7 is a schematic structural diagram of a feature fusion apparatus according to an embodiment of the present disclosure;
fig. 8 shows a schematic diagram of a possible structure of the feature fusion device of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," if any, in the description and claims of this application and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
When an existing user travels through operation vehicles such as a taxi and a network car reservation, some problems exist, for example, when the network car reservation is taken as an example, a driver and a passenger may conflict during the traveling of the vehicle, so that the driver can drive the vehicle in an abnormal state, and certain potential safety hazards exist during the daily travel of the passenger using the network car reservation. In order to solve the above problem, in the related art, it is common that a terminal device captures an image inside a vehicle, and performs target detection based on the captured image. For example, the terminal device is a mobile phone of a driver, and the mobile phone captures an image in the vehicle during driving of the vehicle, and then detects the driver and a passenger in the vehicle based on the captured image, for example, determines whether the driver and the passenger conflict.
However, the image acquired by the terminal device has a large error, for example, when the terminal device acquires an image, a camera of the terminal device is partially or completely blocked, so that the acquired image is incomplete or cannot be acquired. Or in poor light, such as at night, resulting in a blurred acquired image. Or, in the working process, due to performance aging or component damage of the terminal device, the quality of the acquired image is poor, and the error rate of the terminal device for target detection based on the acquired image is high, so that potential safety hazards cannot be found in time, and correct intervention cannot be performed on problems occurring in vehicle driving.
Therefore, the embodiment of the present application provides a feature fusion method, which fuses a plurality of features of an object to be detected, for example, an object image feature and an object voice feature of the object to be detected, to obtain a fusion feature, so as to detect the object to be detected by using the fusion feature, for example, to determine whether a driver and a passenger conflict. The method and the device for detecting the target have the advantages that the multiple characteristics of the target to be detected are obtained and used for detecting the subsequent target, the problem that the error rate of the target detection is high by using the image acquired by the terminal equipment in the prior art is solved, information contained in the fusion characteristics is richer, and the accuracy of the detection result of the subsequent target is further improved.
In this embodiment, the target image features may be obtained by performing feature extraction on the target image of the target to be detected. For example, the target image may be an image of a driver and a passenger in a network appointment car. The image can be acquired by a preset image acquisition device, such as a camera, carried on the net appointment vehicle. Here, taking a preset image acquisition device as the camera as an example, the information such as the number, position, and type of the cameras may be uniformly specified by the network appointment management server. The network car booking management server is used for managing each network car booking, for example, auditing the qualification of the network car booking and a network car booking driver, and monitoring the condition of receiving orders of the network car booking. The network car booking management server can check whether the installation of the camera in the network car booking meets the requirement according to the regulation of the camera, for example, whether the installation of the camera in the network car booking meets the requirement is checked through the installation image of the camera in the network car booking uploaded by a network car booking driver. If the requirement is met, the network car booking management server can be connected with a camera in the network car booking, after the network car booking management server determines that the network car booking receives a list, the camera in the network car booking management server can be controlled to start, images of a driver and passengers in the car are collected until the network car booking completes the single task, and the collection is stopped.
Besides, the target image can be acquired by a terminal device of the target to be detected, for example, a mobile phone of a driver. For example, after checking the qualification of the network car booking and the network car booking driver, the network car booking management server may establish a connection with a mobile phone of the network car booking driver if the checking meets the requirement, and monitor the situation of the network car booking and the order taking through the connection. The network car booking management server can send camera shooting starting prompt information to a mobile phone of a network car booking driver after determining that the network car booking receives a bill. And the network car booking driver starts the mobile phone camera according to the prompt and acquires images of the driver and passengers in the car. Here, after the network car booking driver starts the camera according to the prompt, the camera starting information can be fed back to the network car booking management server, so that the network car booking management server can know the mobile phone state of the network car booking driver in time.
In addition, the target image can be acquired by other equipment, such as a driving recorder, a camera on a driving route and the like. For example, the network appointment management server may establish a connection with a vehicle data recorder in the network appointment, and after determining that the network appointment receives an order, images of a driver and passengers in the vehicle may be collected through the vehicle data recorder in the network appointment. For the online car booking with a fixed driving route, the online car booking management server can acquire images of a driver and passengers in the online car booking through the camera on the driving route.
Also, in the embodiment of the present application, the target speech feature may be obtained by performing feature extraction on the target speech of the target to be detected. For example, the target voice may be the voice of the driver and the passenger in the network appointment car. The voice can be acquired through a voice acquisition device carried on the network appointment car. The information such as the number, the position, the type and the like of the voice acquisition devices can be uniformly specified by the network car booking management server. The network car booking management server can check whether the installation of the voice acquisition device in the network car booking meets the requirement or not according to the regulation. If the requirement is met, the network car booking management server can be connected with the voice acquisition device in the network car booking, and after the network car booking is confirmed to receive a call, the voice acquisition device in the network car booking can be controlled to start to acquire the voice of a driver and passengers in the car until the network car booking completes the single task, and the acquisition is stopped.
The target voice can be acquired through terminal equipment of the target to be detected, for example, a mobile phone of a driver. In addition, the target voice can be acquired by other equipment, such as a vehicle data recorder. For details, reference may be made to the description of acquiring the target image through a mobile phone of a driver or a vehicle data recorder, which is not described herein again.
In some feasible embodiments, the plurality of features of the target to be detected may further include a target text feature of the target to be detected, and the target text feature may be obtained by performing feature extraction on the target text of the target to be detected. For example, the target text may be text information input by a driver and a passenger in the network appointment car. The text information can be obtained by the terminal device of the object to be detected, for example, by the mobile phone of the passenger or the mobile phone of the driver. The above-mentioned network car appointment management server may establish a connection with the mobile phone of the passenger after the passenger places an order through the mobile phone, taking the text information obtained through the mobile phone of the passenger as an example. In the process of online car booking driving, a passenger can directly input text information on a mobile phone and send the text information to an online car booking management server, and the online car booking management server directly obtains the text information input by the passenger. And then take the text information obtained by the mobile phone of the driver as an example, after checking the qualification of the network car booking, the network car booking driver and the like, if the checking meets the requirement, the network car booking management server can be connected with the mobile phone of the network car booking driver. After determining that the online car booking order is received, the online car booking management server can send recording starting prompt information to a mobile phone of a driver of the online car booking. And the network car booking driver starts the mobile phone recording function according to the prompt, collects the voices of the driver and passengers in the car, and sends the collected voices to the network car booking management server. The network car appointment management server converts the received voice into text, and indirectly obtains text information input by a driver. Similarly, the driver can directly input text information on the mobile phone and send the text information to the online car booking management server, and the online car booking management server directly obtains the text information input by the driver.
In addition, the target text can be acquired through a voice acquisition device. After determining that the network appointment car receives the order, the network appointment car management server can control a voice acquisition device in the network appointment car to start, acquire the voice of a driver and passengers in the car, and stop acquisition until the network appointment car completes the single task. The voice acquisition device can transmit the acquired voice to the network car booking management server after acquiring the voice, and the network car booking management server converts the received voice into a text to indirectly acquire a related text.
Here, after the target image, the target voice, and the target text of the target to be detected are obtained, the embodiment of the application may perform feature extraction on the target image, perform feature extraction on the target voice, thereby obtaining a target image feature and a target voice feature, or perform feature extraction on the target text, thereby obtaining a target text feature, further, fuse the features, thereby obtaining a fusion feature, thereby detecting the target to be detected by using the fusion feature, for example, determining whether a collision occurs between a driver and a passenger.
The feature extraction, the feature fusion and the like can be executed by the network car booking management server. The target image is acquired through a preset image acquisition device carried on the online appointment vehicle, the target voice is acquired through a preset voice acquisition device carried on the online appointment vehicle, and the target text is acquired through the voice acquisition device. The preset image acquisition device can send the target image to the network car booking management server after acquiring the target image. And after acquiring the target voice, the preset voice acquisition device sends the target voice to the network car booking management server. The preset voice acquisition device acquires voices of a driver and passengers in the vehicle and then sends the voices to the online car appointment management server to be converted into target files. The network car booking management server extracts the features of the target image and the target voice to obtain the features of the target image and the features of the target voice, or extracts the features of the target text to obtain the features of the target text, and further fuses a plurality of features of the target to be detected, such as the features of the target image and the features of the target voice, to obtain a fused feature, so that the fused feature is used for detecting the target to be detected, such as judging whether a driver and a passenger conflict. If the conflict occurs, the network taxi appointment management server can also send the judgment result to related personnel, discover potential safety hazards in time and perform correct intervention on problems occurring in the driving of the vehicle in time.
Optionally, fig. 1 is a schematic diagram of a feature fusion system architecture provided in an embodiment of the present application. In fig. 1, the objects to be detected are taken as the driver and the passenger in the car reservation. The above architecture includes a network car booking management server 11, a preset image acquisition device 12 and a preset voice acquisition device 13.
It is to be understood that the illustrated structure of the embodiments of the present application does not constitute a specific limitation to the feature fusion architecture. In other possible embodiments of the present application, the foregoing architecture may include more or less components than those shown in the drawings, or combine some components, or split some components, or arrange different components, which may be determined according to practical application scenarios, and is not limited herein. The components shown in fig. 1 may be implemented in hardware, software, or a combination of software and hardware.
In a specific implementation process, the network appointment management server 11 may first determine whether to control the preset image capturing device 12 and the preset voice capturing device 13 in the network appointment to be activated. For example, taking the network appointment management server 11 as an example to determine whether to control the preset image capturing device 12 to start, after determining that the network appointment is taken, the network appointment management server 11 may control the preset image capturing device 12 in the network appointment to start, capture images of the driver and the passenger in the vehicle, and stop capturing until the network appointment completes the single task. Or, the network appointment management server 11 may determine whether to control the preset image capturing device 12 to start according to the order receiving evaluation condition of the network appointment driver. Illustratively, if the order receiving evaluation of the net appointment vehicle driver is poor, for example, more than two complaints are received in the last month, the net appointment vehicle management server 11 controls the preset image acquisition device 12 to start, acquire images of the driver and passengers in the vehicle, and stop acquisition until the net appointment vehicle completes the single task. Similarly, the network appointment management server 11 may also control the preset voice collecting device 13 to collect the voices of the driver and the passengers in the vehicle in the above manner.
After the preset image acquisition device 12 is started to acquire images of a driver and passengers in the network appointment, the preset image acquisition device 12 can transmit the acquired images to the network appointment management server 11. Similarly, after the preset voice collecting device 13 is started to collect the voices of the driver and the passengers in the online car booking, the preset voice collecting device 13 may send the collected voices to the online car booking management server 11. After receiving the image and the voice, the network appointment management server 11 may perform feature extraction on the image to obtain image features, and perform feature extraction on the voice to obtain voice features. The network appointment management server 11 may convert the voice of the driver and the passenger in the network appointment into a text, and extract a feature of the text to obtain a text feature. Further, the internet-based car booking management server 11 merges a plurality of features of the driver and the passenger in the internet-based car booking, such as the image feature, the voice feature, and the text feature, to obtain a merged feature, thereby detecting the driver and the passenger in the internet-based car booking, for example, determining whether the driver and the passenger have a collision. The above-mentioned a plurality of characteristics that acquire driver and passenger are used for follow-up target detection, solve the current image that utilizes terminal equipment to gather and carry out the higher problem of target detection error rate, and the information that contains in the fusion characteristic is abundanter moreover, has further improved the rate of accuracy of follow-up target detection result.
In addition, when there are a lot of car appointments, the magnitude of data received by the car appointment management server is large, and the required computing resources are large, which may result in a long period for detecting the target through the car appointment management server. In order to solve the problem, in the embodiment of the present application, the feature extraction, the feature fusion, and the like may be performed by the terminal device of the target to be detected, for example, by a mobile phone of a driver. For example, the preset image capturing device 12 may transmit the captured images of the driver and the passengers in the taxi to the mobile phone of the driver. The preset voice collecting device 13 may also send collected voices of a driver and passengers in the network appointment car to a mobile phone of the driver. The driver can perform feature extraction on the image through the mobile phone to obtain image features, and perform feature extraction on the voice to obtain voice features. In addition, the driver can also convert the voice of the driver and the passenger in the online appointment into a text through a mobile phone, and the text is subjected to feature extraction to obtain text features. Further, the driver fuses a plurality of features of the driver and the passenger in the online car booking, for example, the image feature, the voice feature, and the text feature, by using the mobile phone, to obtain a fused feature, and transmits the fused feature to the online car booking management server 11. Therefore, the online car appointment management server 11 detects the target to be detected by using the fusion feature, for example, determines whether a collision occurs between the driver and the passenger.
Here, the embodiment of the present application performs the target detection by combining the terminal device and the network car booking management server, so that the computing capability of the terminal device can be fully utilized, the computing pressure of the network car booking management server is reduced, and the processing speed of the network car booking management server is increased.
If the terminal device, for example, the mobile phone of the driver cannot perform the feature extraction, feature fusion, or the like, a processing request may be sent to the network appointment management server 11. After receiving the processing request, the network car booking management server 11 may send an information acquisition request to the preset image capturing device 12 and the preset voice capturing device 13, so that the preset image capturing device 12 may send the captured image to the network car booking management server 11, and the preset voice capturing device 13 sends the captured voice to the network car booking management server 11, and the like. The network appointment management server 11 performs the above-described feature extraction, feature fusion, and the like based on the received information, and detects the driver and the passenger in the network appointment based on the fused features.
The technical solutions of the present application are described below with several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 2 is a schematic flow diagram of a feature fusion method provided in an embodiment of the present application, and the embodiment of the present application provides a feature fusion method, which may be applied to feature fusion processing and may be executed by any device that executes the feature fusion method, where the device may be implemented by software and/or hardware. As shown in fig. 2, based on the system architecture shown in fig. 1, the feature fusion method provided in the embodiment of the present application includes the following steps:
s201: and acquiring target characteristics of the target to be detected, wherein the target characteristics comprise target image characteristics and target voice characteristics.
The target to be detected may be determined according to an actual situation, for example, whether a collision occurs between a driver and a passenger is determined during the driving of the vehicle, and the target to be detected may be the driver and the passenger in the vehicle.
The number of the target image features can be one or more, and can be determined according to actual conditions. Similarly, the number of the target speech features may also be one or more, and the embodiment of the present application does not particularly limit this.
S202: and inputting the target features into a preset fusion model, wherein the preset fusion model is obtained by training reference features and reference fusion features, and the reference features comprise reference image features and reference voice features.
Wherein the above-mentioned reference to the fused feature is to be understood as a reference to the real feature. The reference fusion features appearing in the subsequent description may refer to the description herein, and are not repeated in the following.
The predetermined fusion model may employ a full link layer having a dimension AxB, for example, a full link layer of 10 × 1 as shown in fig. 3. In order to match the dimension of the feature input into the preset fusion model with the dimension of the preset fusion model, in the embodiment of the present application, before the target feature is input into the preset fusion model, the method may further include: the dimensions of the target feature are determined. And if the dimension is different from the preset feature dimension input into the preset fusion model, performing dimension splitting on the target feature according to the preset feature dimension. And then, inputting the target characteristics after dimension splitting into the preset fusion model.
Here, the execution subject is the network car booking management server described above as an example. The performing the dimension splitting on the target feature according to the preset feature dimension may include: the network car-booking management server determines the number of rows and the number of columns of the preset characteristic dimension, for example, if the preset fusion model adopts a full connection layer with the dimension of 10x1, the preset characteristic dimension of the preset fusion model is input to be 10x1, and the number of rows and the number of columns of the preset characteristic dimension are determined to be 10 and 1. If there are 10 target image features and 10 target voice features, that is, the dimension of the target feature is 10x2, which is different from the preset feature dimension 10x1 of the input preset fusion model, the dimension of the target feature needs to be split. Illustratively, the network car booking management server splits the line number of the target feature according to the line number of the preset feature dimension, so that the line number of the split target feature is equal to the line number of the preset feature dimension. And/or the network taxi appointment management server splits the column number of the target feature according to the column number of the preset feature dimension, so that the split column number of the target feature is equal to the column number of the preset feature dimension. Here, the dimension of the target feature is 10 × 2, the number of rows is 10, and the number of columns is 2. The feature dimension is 10x1, the number of rows is 10, and the number of columns is 1. The number of rows of the target characteristic dimension is the same as that of the preset characteristic dimension, and the network car booking management server only splits the number of columns of the target characteristic without splitting the number of rows of the target characteristic, namely, splits the dimension 10x2 of the target characteristic into two 10x 1. The dimension of the split target feature is the same as the dimension of the preset feature input into the preset fusion model, and then the network appointment vehicle management server can input the dimension-split target feature into the preset fusion model.
S203: and obtaining the target fusion characteristics of the target to be detected according to the output of the preset fusion model.
Here, if the network car booking management server performs dimension splitting on the target feature before inputting the target feature into a preset fusion model, and inputs the target feature after the dimension splitting into the preset fusion model, where obtaining the target fusion feature of the target to be detected according to the output of the preset fusion model by the network car booking management server may include: and acquiring fusion features corresponding to the target features after the dimension splitting output by the preset fusion model, and performing dimension combination on the acquired fusion features to acquire the target fusion features.
For example, as shown in fig. 4, the process of performing feature fusion by the network car-booking management server using a preset fusion model may be that, first, the network car-booking management server obtains the target feature of the target to be detected, inputs the target feature into the preset fusion model, and further obtains the target fusion feature of the target to be detected.
After the target fusion feature of the target to be detected is obtained, the network appointment management server may detect the target to be detected according to the target fusion feature, for example, determine whether a conflict occurs between a driver and a passenger. For example, the network appointment management server may input the target fusion feature into a fourth preset model, where the fourth preset model is obtained by training with reference to the fusion feature and a reference state. Therefore, the network car booking management server acquires the target state of the target to be detected output by the fourth preset model, and the target to be detected is detected.
According to the embodiment of the application, the network car booking management server firstly obtains the target image characteristics and the target voice characteristics of the target to be detected, then the fusion model is used for fusing the characteristics to obtain the fusion characteristics of the target to be detected, and therefore the fusion characteristics are used for detecting the target to be detected, for example, whether a driver and a passenger conflict or not is judged. When the target detection is carried out, the image characteristics of the target to be detected and the voice characteristics of the target to be detected are considered, namely, the characteristics of the target to be detected are obtained and used for detecting the subsequent target, the problem that the error rate of target detection is high in the existing image acquired by using the terminal equipment is solved, the image characteristics and the voice characteristics of the target to be detected are fused, information contained in the fused characteristics is rich, and the accuracy of the detection result of the subsequent target is further improved.
Here, before inputting the target features into a preset fusion model, the network appointment management server needs to train the preset fusion model, so as to subsequently input the target features into the trained preset fusion model, and obtain the target fusion features of the target to be detected according to the output of the preset fusion model. In the training process, the network appointment vehicle management server can input the reference features into a preset fusion model, and then determines the fusion accuracy according to the fusion features output by the preset fusion model and the reference voice features. If the fusion accuracy is lower than the preset accuracy threshold, the network appointment vehicle management server can adjust the preset fusion model according to the fusion accuracy so as to improve the fusion accuracy, use the adjusted preset fusion model as a new preset fusion model, and re-execute the step of inputting the reference characteristics into the preset fusion model.
Wherein, the reference features comprise reference image features and reference voice features. The reference image feature may be obtained by performing feature extraction on the reference image. If the network appointment vehicle is taken as an example, the reference image can be acquired by a preset image acquisition device, such as a camera, carried on the network appointment vehicle. Alternatively, the reference image is acquired by a terminal device of the driver or the passenger, for example, by a mobile phone of the driver. Or the reference image is acquired by a vehicle data recorder, a camera on a driving route and the like. The specific process may refer to the above-mentioned process of obtaining the target image, and is not described herein again. Also, the above-described reference speech feature can be obtained by performing feature extraction on the reference speech. The reference speech acquiring process may refer to the target speech acquiring process, and is not described herein again.
The reference fusion characteristics can be obtained by weighted average of the reference characteristics by the network appointment management server. Illustratively, if the reference features include a reference image feature and a reference voice feature, the number of the reference image features is plural, and the number of the reference voice features is also plural. The network car-booking management server can respectively calculate the weighted average value of the plurality of reference image characteristics and the weighted average value of the plurality of reference voice characteristics, and then carry out weighted average on the weighted average values of the plurality of reference image characteristics and the weighted average values of the plurality of reference voice characteristics to obtain the reference fusion characteristics. The weight in the weighted average processing may be determined according to actual conditions.
And after the network appointment management server inputs the reference features into a preset fusion model, acquiring fusion features output by the preset fusion model, and comparing the acquired fusion features with the reference fusion features. Illustratively, the feature similarity is compared between the two. And the network appointment management server determines the fusion accuracy according to the feature similarity obtained by comparison. If the fusion accuracy is lower than the preset accuracy threshold, the characteristic fusion effect of the preset fusion model is poor, the preset fusion model needs to be adjusted to improve the fusion accuracy, the adjusted preset fusion model is used as a new preset fusion model, the steps are executed again until the determined fusion accuracy is higher than or equal to the preset accuracy threshold, and the training is stopped. Here, the preset accuracy threshold may be set according to practical situations, for example, 90% or 95%, and the like, and this is not particularly limited in the embodiments of the present application. For example, the training process of the preset fusion model may be as shown in fig. 5, and the network appointment vehicle management server first obtains a training sample, where the training sample includes the reference feature. And then the network appointment vehicle management server trains a preset fusion model by using the training samples, namely, the reference characteristics are input into the preset fusion model, fusion accuracy is determined according to the fusion characteristics output by the preset fusion model and the reference fusion characteristics, and the training is stopped until the fusion accuracy is greater than or equal to a preset accuracy threshold value, so that the trained preset fusion model is obtained.
In addition, before inputting the target fusion characteristics into the fourth preset model, the network reservation management server also needs to train the fourth preset model, so that the target fusion characteristics are subsequently input into the trained fourth preset model, and the target state of the target to be detected output by the fourth preset model is acquired, thereby realizing the detection of the target to be detected. In the training process, the fourth preset model inputs the reference fusion feature and outputs the state of the driver and the passenger, for example, whether the driver and the passenger conflict or not. For example, the network reservation management server inputs the reference fusion feature into a fourth preset model, and then determines whether the state of the driver and the passenger output by the fourth preset model is the same as the reference state. If not, adjusting the fourth preset model to enable the states of the driver and the passengers output by the fourth preset model to be the reference state. The reference state may be a real state of the driver and the passenger, such as a collision or no collision.
In the embodiment of the present application, the target feature may further include a target text feature in addition to the target image feature and the target voice feature. When the network car booking management server obtains the target characteristics of the target to be detected, the preset model is considered to be used for obtaining the target characteristics. Fig. 6 is a schematic flowchart of another feature fusion method according to an embodiment of the present application. As shown in fig. 6, the method includes:
s601: inputting a target image of a target to be detected into a first preset model, wherein the first preset model is obtained through feature training of a reference image and the reference image.
The execution subject is the network car booking management server. The network car appointment management server can acquire a plurality of images of the target to be detected. For example, taking the target to be detected as a driver and a passenger in the vehicle as an example, the network car booking management server may acquire a plurality of images of the driver and the passenger in the vehicle, and then screen the acquired plurality of images to acquire an image meeting the preset image requirement as the target image. Wherein, the preset image requirement may include: the image is not blocked, and the image definition reaches a preset definition threshold value.
For example, the network appointment management server may use a preset image screening model when screening the acquired plurality of images. The network appointment management server inputs the acquired images into a preset image screening model, and acquires the target image output by the preset image screening model. The image screening model is obtained through training of a plurality of reference images and reference target images.
The network appointment vehicle management server screens out images which are not blocked and have the image definition reaching a preset definition threshold from the plurality of acquired images as target images, so that the characteristics of the target images acquired based on the target images can be more accurate, and the accuracy of subsequent processing results is improved.
S602: and acquiring the target image characteristics output by the first preset model.
S603: and inputting the target voice of the target to be detected into a second preset model, wherein the second preset model is obtained by training reference voice and reference voice characteristics.
In the embodiment of the application, the network car appointment management server can acquire a plurality of voices of the target to be detected. For example, taking the target to be detected as the driver and the passenger in the vehicle as an example, the network car appointment management server may acquire a plurality of voices of the driver and the passenger in the vehicle, and then filter the acquired plurality of voices to acquire the voice meeting the preset voice requirement as the target voice. Wherein, the preset voice requirement may include: the voice delay is lower than a preset delay threshold, the voice jitter is lower than a preset jitter threshold and the like.
For example, the network appointment management server may use a preset voice screening model when screening the acquired voices. And the network appointment management server inputs the acquired voices into a preset voice screening model and acquires the target voice output by the preset voice screening model. The voice screening model is obtained through training of a plurality of reference voices and reference target voices.
Here, the network appointment management server screens out low-delay and low-jitter voices from the acquired voices to serve as target voices, and accuracy of subsequent processing results is guaranteed.
S604: and acquiring the target voice characteristics output by the second preset model.
S605: and inputting a target text of the target to be detected into a third preset model, wherein the third preset model is obtained through training of the reference text and the reference text characteristics.
The network appointment management server can acquire a plurality of texts of the target to be detected. For example, taking the target to be detected as the driver and the passenger in the vehicle as an example, the network appointment management server may obtain a plurality of texts of the driver and the passenger in the vehicle, and then filter the obtained plurality of texts to obtain a text meeting the preset text requirement as the target text. Wherein, the preset text requirement may include: carrying preset keywords. The preset keyword may be determined by text acquired when a collision occurs between the driver and the passenger.
For example, the network appointment management server may use a preset text screening model when screening the acquired plurality of texts. And the network appointment management server inputs the obtained texts into a preset text screening model, and obtains the target text output by the preset text screening model. The text screening model is obtained by training a plurality of reference texts and reference target texts.
And the network appointment management server screens out texts carrying preset keywords from the obtained texts as target texts, so that the accuracy of subsequent target detection results is improved.
S606: and acquiring the target text characteristics output by the third preset model.
S607: and inputting the target image feature, the target voice feature and the target text feature into a preset fusion model, wherein the preset fusion model is obtained by training reference features and reference fusion features, and the reference features comprise reference image features, reference voice features and reference text features.
S608: and obtaining the target fusion characteristics of the target to be detected according to the output of the preset fusion model.
The steps S607 to S608 are the same as the steps S202 to S203, and are not described herein again.
When the target characteristics of the target to be detected are obtained, the preset model is considered to be used for obtaining the target characteristics, and the method and the device are simple, quick and suitable for application. In addition, when the target detection is performed, not only the image characteristics of the target to be detected but also the voice characteristics of the target to be detected are considered, namely, the multiple characteristics of the target to be detected are obtained for subsequent target detection, so that the problem that the error rate of target detection performed by using the image acquired by the terminal equipment is high is solved, the image characteristics and the voice characteristics of the target to be detected are fused, the information contained in the fused characteristics is richer, and the accuracy of the subsequent target detection result is further improved.
Here, before inputting the target image of the target to be detected into the first preset model, the network appointment vehicle management server needs to train the first preset model, so that the target image is subsequently input into the trained first preset model, and the target image feature output by the first preset model is obtained. In the training process, the network appointment vehicle management server can input the reference image into a first preset model, and then determine the output accuracy according to the image characteristics output by the first preset model and the reference image characteristics. If the output accuracy is lower than the preset accuracy threshold, the network appointment management server can adjust the first preset model according to the output accuracy so as to improve the output accuracy, and the adjusted first preset model is used as a new first preset model to re-execute the step of inputting the target image into the first preset model.
The reference image features may be obtained by performing feature extraction on the reference image, and for example, the network appointment management server may perform feature extraction on the reference image by using a technique such as a histogram of oriented gradients, so as to obtain the reference image features.
And after the reference image is input into the first preset model, the network appointment vehicle management server acquires the image characteristics output by the first preset model and compares the acquired image characteristics with the reference image characteristics. Illustratively, the feature similarity is compared between the two. And the network appointment management server determines the output accuracy according to the feature similarity obtained by comparison. If the output accuracy is lower than the preset accuracy threshold, the characteristic extraction effect of the first preset model is poor, the first preset model needs to be adjusted to improve the output accuracy, the adjusted first preset model is used as a new first preset model, the steps are executed again until the determined output accuracy is higher than or equal to the preset accuracy threshold, and the training is stopped.
Similarly, before inputting the target voice of the target to be detected into the second preset model, the network appointment management server needs to train the second preset model; before inputting the target text of the target to be detected into the third preset model, the third preset model also needs to be trained. Here, the training process of the second preset model and the training process of the third preset model may refer to the training process of the first preset model, and details are not repeated here.
The reference voice feature may be obtained by performing feature extraction on the reference voice, and for example, the network car booking management server may perform feature extraction on the reference voice by using a discrete wavelet transform or other technology to obtain the reference voice feature. The reference text features may be obtained by feature extraction of the reference text. The network car booking management server can extract the features of the reference text by using a TF-IDF algorithm and the like to obtain the features of the reference text.
In addition, when the network appointment management server screens the plurality of acquired images, a preset image screening model can be used. Therefore, the network appointment management server needs to train the preset image screening model before screening the plurality of acquired images by using the preset image screening model. In the training process, the network appointment management server may input the plurality of reference images into a preset image screening model, and then determine whether an image output by the preset image screening model is the same as the reference target image. If the difference is different, the network appointment management server can adjust the preset image screening model so that the image output by the preset image screening model is the reference target image. The reference target image may be an image in which images in the plurality of reference images are not blocked and the image definition reaches a preset definition threshold.
Similarly, the network appointment management server needs to train a preset voice screening model before screening the acquired multiple voices by using the preset voice screening model; before the obtained plurality of texts are screened by using the preset text screening model, the preset text screening model needs to be trained. Here, the training process of the preset speech filtering model and the preset text filtering model may refer to the training process of the preset image filtering model, and will not be described herein again.
The reference target voice may be low-delay and low-jitter voice in the reference voices. The reference target text may be a text carrying preset keywords in the multiple reference texts.
Fig. 7 is a schematic structural diagram of a feature fusion apparatus provided in an embodiment of the present application, corresponding to the feature fusion method in the foregoing embodiment. For convenience of explanation, only portions related to the embodiments of the present application are shown. Fig. 7 is a schematic structural diagram of a feature fusion apparatus according to an embodiment of the present application, where the feature fusion apparatus 70 includes: a feature acquisition module 701, a feature input module 702, and a fused feature acquisition module 703. The feature fusion means may be the processing means itself or a chip or an integrated circuit that implements the functions of the processing means. It should be noted here that the division of the feature obtaining module, the feature inputting module and the fused feature obtaining module is only a division of logic functions, and the two may be integrated or independent physically.
The feature obtaining module 701 is configured to obtain a target feature of a target to be detected, where the target feature includes a target image feature and a target voice feature.
A feature input module 702, configured to input the target feature into a preset fusion model, where the preset fusion model is obtained through training of reference features and reference fusion features, and the reference features include reference image features and reference voice features.
A fusion feature obtaining module 703, configured to obtain a target fusion feature of the target to be detected according to the output of the preset fusion model.
In one possible design, the feature input module 702 is further configured to:
determining dimensions of the target feature;
and if the dimension is different from the preset feature dimension input into the preset fusion model, performing dimension splitting on the target feature according to the preset feature dimension.
The feature input module 702 is specifically configured to:
and respectively inputting the target characteristics after dimension splitting into the preset fusion model.
In one possible design, the fused feature obtaining module 703 is specifically configured to:
acquiring fusion characteristics corresponding to the dimension-split target characteristics output by the preset fusion model;
and carrying out dimension combination on the obtained fusion features to obtain the target fusion features.
In one possible design, the feature input module 702 is specifically configured to:
determining the number of rows and columns of the preset feature dimension;
splitting the line number of the target feature according to the line number of the preset feature dimension, so that the line number of the split target feature is equal to the line number of the preset feature dimension;
and splitting the column number of the target feature according to the column number of the preset feature dimension, so that the column number of the split target feature is equal to the column number of the preset feature dimension.
In one possible design, the feature input module 702 is further configured to:
inputting the reference features into the preset fusion model;
determining fusion accuracy according to the fusion features output by the preset fusion model and the reference fusion features;
if the fusion accuracy is lower than a preset accuracy threshold, adjusting the preset fusion model according to the fusion accuracy to improve the fusion accuracy, taking the adjusted preset fusion model as a new preset fusion model, and re-executing the step of inputting the reference features into the preset fusion model.
In one possible design, the feature obtaining module 701 is specifically configured to:
inputting a target image of the target to be detected into a first preset model, and inputting target voice of the target to be detected into a second preset model, wherein the first preset model is obtained by training reference images and reference image characteristics, and the second preset model is obtained by training reference voice and reference voice characteristics;
and acquiring the target image characteristics output by the first preset model and the target voice characteristics output by the second preset model.
In one possible design, the target feature further includes a target text feature.
The feature obtaining module 701 is further configured to:
inputting a target text of the target to be detected into a third preset model, wherein the third preset model is obtained through training of reference texts and reference text characteristics;
and acquiring the target text features output by the third preset model.
In one possible design, the fused feature obtaining module 703 is further configured to:
and detecting the target to be detected according to the target fusion characteristics.
In one possible design, the fused feature obtaining module 703 is specifically configured to:
inputting the target fusion characteristics into a fourth preset model, wherein the fourth preset model is obtained by reference fusion characteristics and reference state training;
and acquiring the target state of the target to be detected output by the fourth preset model.
In one possible design, the feature obtaining module is further configured to:
and receiving the target image sent by a preset image acquisition device.
In one possible design, the feature obtaining module is further configured to:
and receiving the target image sent by the terminal equipment of the target to be detected.
In one possible design, the feature obtaining module is further configured to:
and receiving the target voice sent by a preset voice acquisition device.
In one possible design, the feature obtaining module is further configured to:
and receiving the target voice sent by the terminal equipment of the target to be detected.
In one possible design, the feature obtaining module is further configured to:
and receiving the voice of the target to be detected sent by a preset voice acquisition device, and converting the received voice into the target text.
In one possible design, the feature obtaining module is further configured to:
and receiving the voice of the target to be detected sent by the terminal equipment of the target to be detected, and converting the received voice into the target text.
In one possible design, the feature obtaining module is further configured to:
and receiving the target text sent by the terminal equipment of the target to be detected.
The apparatus provided in the embodiment of the present application may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.
Optionally, fig. 8 schematically provides a possible basic hardware architecture of the feature fusion apparatus described in the present application.
Referring to fig. 8, a feature fusion device 800 includes at least one processor 801 and a communication interface 803. Further optionally, a memory 802 and a bus 804 may also be included.
The feature fusion apparatus 800 may be the processing device, and the present application is not limited thereto. In the feature fusion apparatus 800, the number of the processors 801 may be one or more, and fig. 8 illustrates only one of the processors 801. Alternatively, the processor 801 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Digital Signal Processing (DSP). If the feature fusion apparatus 800 has multiple processors 801, the types of the multiple processors 801 may be different, or may be the same. Optionally, the plurality of processors 801 of the feature fusion apparatus 800 may also be integrated into a multi-core processor.
Memory 802 stores computer instructions and data; the memory 802 may store computer instructions and data necessary to implement the above-described feature fusion methods provided herein, e.g., the memory 802 stores instructions for implementing the steps of the above-described feature fusion methods. The memory 802 may be any one or any combination of the following storage media: nonvolatile memory (e.g., Read Only Memory (ROM), Solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory.
The communication interface 803 may provide information input/output for the at least one processor. Any one or any combination of the following devices may also be included: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.
Optionally, the communication interface 803 may also be used for the feature fusion apparatus 800 to perform data communication with other computing apparatuses or terminals.
Further alternatively, fig. 8 shows bus 804 as a thick line. A bus 804 may connect the processor 801 with the memory 802 and the communication interface 803. Thus, via bus 804, processor 801 may access memory 802 and may also interact with other computing devices or terminals using communication interface 803.
In the present application, the feature fusion apparatus 800 executes computer instructions in the memory 802, so that the feature fusion apparatus 800 implements the feature fusion method provided in the present application, or the feature fusion apparatus 800 deploys the feature fusion device.
In view of logic function division, as shown in fig. 8, the memory 802 may include a feature obtaining module 701, a feature input module 702, and a fused feature obtaining module 703. The inclusion herein merely refers to that the instructions stored in the memory may, when executed, implement the functionality of the feature acquisition module, the feature input module, and the fused feature acquisition module, respectively, and is not limited to a physical structure.
In addition, the feature fusion device may be implemented by software as shown in fig. 8, or may be implemented by hardware as a hardware module or a circuit unit.
A computer-readable storage medium is provided, the computer program product comprising computer instructions that instruct a computing device to perform the above-described feature fusion method provided herein.
The present application provides a chip comprising at least one processor and a communication interface providing information input and/or output for the at least one processor. Further, the chip may also include at least one memory for storing computer instructions. The at least one processor is used for calling and executing the computer instructions to execute the above feature fusion method provided by the application.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Claims (19)

1. A method of feature fusion, comprising:
acquiring target characteristics of a target to be detected, wherein the target characteristics comprise target image characteristics and target voice characteristics;
inputting the target features into a preset fusion model, wherein the preset fusion model is obtained through reference feature and reference fusion feature training, and the reference features comprise reference image features and reference voice features;
and obtaining the target fusion characteristics of the target to be detected according to the output of the preset fusion model.
2. The method according to claim 1, further comprising, prior to said inputting the target feature into a preset fusion model:
determining dimensions of the target feature;
if the dimension is different from a preset feature dimension input into the preset fusion model, performing dimension splitting on the target feature according to the preset feature dimension;
the inputting the target feature into a preset fusion model comprises:
and respectively inputting the target characteristics after dimension splitting into the preset fusion model.
3. The method according to claim 2, wherein the obtaining of the target fusion feature of the target to be detected according to the output of the preset fusion model comprises:
acquiring fusion characteristics corresponding to the dimension-split target characteristics output by the preset fusion model;
and carrying out dimension combination on the obtained fusion features to obtain the target fusion features.
4. The method according to claim 2, wherein the target feature is dimension-split according to the preset feature dimension;
determining the number of rows and columns of the preset feature dimension;
splitting the line number of the target feature according to the line number of the preset feature dimension, so that the line number of the split target feature is equal to the line number of the preset feature dimension;
and/or splitting the column number of the target feature according to the column number of the preset feature dimension, so that the column number of the split target feature is equal to the column number of the preset feature dimension.
5. The method according to claim 1, further comprising, prior to said inputting the target feature into a preset fusion model:
inputting the reference features into the preset fusion model;
determining fusion accuracy according to the fusion features output by the preset fusion model and the reference fusion features;
if the fusion accuracy is lower than a preset accuracy threshold, adjusting the preset fusion model according to the fusion accuracy to improve the fusion accuracy, taking the adjusted preset fusion model as a new preset fusion model, and re-executing the step of inputting the reference features into the preset fusion model.
6. The method according to any one of claims 1 to 5, wherein the acquiring the target feature of the target to be detected comprises:
inputting a target image of the target to be detected into a first preset model, and inputting target voice of the target to be detected into a second preset model, wherein the first preset model is obtained by training reference images and reference image characteristics, and the second preset model is obtained by training reference voice and reference voice characteristics;
and acquiring the target image characteristics output by the first preset model and the target voice characteristics output by the second preset model.
7. The method of claim 6, wherein the target features further comprise target text features;
the method for acquiring the target characteristics of the target to be detected further comprises the following steps:
inputting a target text of the target to be detected into a third preset model, wherein the third preset model is obtained through training of reference texts and reference text characteristics;
and acquiring the target text features output by the third preset model.
8. The method according to any one of claims 1 to 5, characterized in that after obtaining the target fusion feature of the target to be detected according to the output of the preset fusion model, the method further comprises:
and detecting the target to be detected according to the target fusion characteristics.
9. The method according to claim 8, wherein the detecting the target to be detected according to the target fusion feature comprises:
inputting the target fusion characteristics into a fourth preset model, wherein the fourth preset model is obtained by reference fusion characteristics and reference state training;
and acquiring the target state of the target to be detected output by the fourth preset model.
10. The method according to claim 6, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, further comprising:
and receiving the target image sent by a preset image acquisition device.
11. The method according to claim 6, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, further comprising:
and receiving the target image sent by the terminal equipment of the target to be detected.
12. The method according to claim 6, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, further comprising:
and receiving the target voice sent by a preset voice acquisition device.
13. The method according to claim 6, before the inputting the target image of the target to be detected into the first preset model and the target voice of the target to be detected into the second preset model, further comprising:
and receiving the target voice sent by the terminal equipment of the target to be detected.
14. The method according to claim 7, further comprising, before the inputting the target text of the target to be detected into a third preset model:
and receiving the voice of the target to be detected sent by a preset voice acquisition device, and converting the received voice into the target text.
15. The method according to claim 7, further comprising, before the inputting the target text of the target to be detected into a third preset model:
and receiving the voice of the target to be detected sent by the terminal equipment of the target to be detected, and converting the received voice into the target text.
16. The method according to claim 7, further comprising, before the inputting the target text of the target to be detected into a third preset model:
and receiving the target text sent by the terminal equipment of the target to be detected.
17. A feature fusion apparatus, comprising:
the characteristic acquisition module is used for acquiring target characteristics of a target to be detected, and the target characteristics comprise target image characteristics and target voice characteristics;
the feature input module is used for inputting the target features into a preset fusion model, wherein the preset fusion model is obtained through reference feature and reference fusion feature training, and the reference features comprise reference image features and reference voice features;
and the fusion characteristic obtaining module is used for obtaining the target fusion characteristics of the target to be detected according to the output of the preset fusion model.
18. A feature fusion device, comprising:
a processor;
a memory; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-16.
19. A computer-readable storage medium, characterized in that it stores a computer program that causes a server to execute the method of any of claims 1-16.
CN202011181418.9A 2020-10-29 2020-10-29 Feature fusion method, device and storage medium Pending CN112307948A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011181418.9A CN112307948A (en) 2020-10-29 2020-10-29 Feature fusion method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011181418.9A CN112307948A (en) 2020-10-29 2020-10-29 Feature fusion method, device and storage medium

Publications (1)

Publication Number Publication Date
CN112307948A true CN112307948A (en) 2021-02-02

Family

ID=74331662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011181418.9A Pending CN112307948A (en) 2020-10-29 2020-10-29 Feature fusion method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112307948A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361462A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Method and device for video processing and caption detection model
CN114373448A (en) * 2022-03-22 2022-04-19 北京沃丰时代数据科技有限公司 Topic detection method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361462A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Method and device for video processing and caption detection model
CN113361462B (en) * 2021-06-30 2022-11-08 北京百度网讯科技有限公司 Method and device for video processing and caption detection model
CN114373448A (en) * 2022-03-22 2022-04-19 北京沃丰时代数据科技有限公司 Topic detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
KR102418446B1 (en) Picture-based vehicle damage assessment method and apparatus, and electronic device
KR20190060817A (en) Image based vehicle damage determination method and apparatus, and electronic device
JP2020517015A (en) Picture-based vehicle damage assessment method and apparatus, and electronic device
CN109740573B (en) Video analysis method, device, equipment and server
CN112307948A (en) Feature fusion method, device and storage medium
CN110084113B (en) Living body detection method, living body detection device, living body detection system, server and readable storage medium
CN110031697B (en) Method, device, system and computer readable medium for testing target identification equipment
CN112528940B (en) Training method, recognition method and device of driver behavior recognition model
CN107393308A (en) A kind of method, apparatus and managing system of car parking for identifying car plate
US11120308B2 (en) Vehicle damage detection method based on image analysis, electronic device and storage medium
JP6387838B2 (en) Traffic violation management system and traffic violation management method
CN109800684B (en) Method and device for determining object in video
CN110443221A (en) A kind of licence plate recognition method and system
CN110619692A (en) Accident scene restoration method, system and device
CN112507314B (en) Client identity verification method, device, electronic equipment and storage medium
CN112052780A (en) Face verification method, device and system and storage medium
CN111339949A (en) License plate recognition method and device and inspection vehicle
CN110807394A (en) Emotion recognition method, test driving experience evaluation method, device, equipment and medium
CN110110141B (en) Camera list sorting method and device and monitoring management platform
CN112560685A (en) Facial expression recognition method and device and storage medium
CN109698900B (en) Data processing method, device and monitoring system
CN111161743B (en) Cash receiving supervision method and system based on voice recognition
CN104067606A (en) Camera, camera system, and self-diagnosis method
CN113537087A (en) Intelligent traffic information processing method and device and server
CN111462480A (en) Traffic image evidence verification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination