Researchers are focusing increasingly on creating systems that can handle multi-modal data exploration, which combines structured and unstructured data. This involves analyzing text, images, videos, and databases to answer complex queries. These capabilities are crucial in healthcare, where medical professionals interact with patient records, medical imaging, and textual reports. Similarly, multi-modal exploration helps interpret databases with metadata, textual critiques, and artwork images in art curation or research. Seamlessly combining these data types offers significant potential for decision-making and insights.
One of the main challenges in this field is enabling users to query multi-modal data using natural language. Traditional systems struggle to interpret complex queries that involve multiple data formats, such as asking for trends in structured tables while analyzing related image content. Moreover, the absence of tools that provide clear explanations for query outcomes makes it difficult for users to trust and validate the results. These limitations create a gap between advanced data processing capabilities and real-world usability.
Current solutions attempt to address these challenges using two main approaches. The first integrates multiple modalities into unified query languages, such as NeuralSQL, which embeds vision-language functions directly into SQL commands. The second uses agentic workflows that coordinate various tools for analyzing specific modalities, exemplified by CAESURA. While these approaches have advanced the field, they fall short in optimizing task execution, ensuring explainability, and addressing complex queries efficiently. These shortcomings highlight the need for a system capable of dynamic adaptation and clear reasoning.
Researchers at Zurich University of Applied Sciences have introduced XMODE, a novel system designed to address these issues. XMODE enables explainable multi-modal data exploration using a Large Language Model (LLM)-based agentic framework. The system interprets user queries and decomposes them into subtasks like SQL generation and image analysis. By creating workflows represented as Directed Acyclic Graphs (DAGs), XMODE optimizes the sequence and execution of tasks. This approach improves efficiency and accuracy compared to state-of-the-art systems like CAESURA and NeuralSQL. Moreover, XMODE supports task re-planning, enabling it to adapt when specific components fail.
The architecture of XMODE includes five key components: planning and expert model allocation, execution and self-debugging, decision-making, expert tools, and a shared data repository. When a query is received, the system constructs a detailed workflow of tasks, assigning them to appropriate tools like SQL generation modules and image analysis models. These tasks are executed in parallel wherever possible, reducing latency and computational costs. Further, XMODE’s self-debugging capabilities allow it to identify and rectify errors in task execution, ensuring reliability. This adaptability is critical for handling complex workflows that involve diverse data modalities.
XMODE demonstrated superior performance during testing on two datasets. On an artwork dataset, XMODE achieved 63.33% accuracy overall, compared to CAESURA’s 33.33%. It excelled in handling tasks requiring complex outputs, such as plots and combined data structures, achieving 100% accuracy in generating plot-plot and plot-data structure outputs. Also, XMODE’s ability to execute tasks in parallel reduced latency to 3,040 milliseconds, compared to CAESURA’s 5,821 milliseconds. These results highlight its efficiency in processing natural language queries over multi-modal datasets.
On the electronic health records (EHR) dataset, XMODE achieved 51% accuracy, outperforming NeuralSQL in multi-table queries, scoring 77.50% compared to NeuralSQL’s 47.50%. The system demonstrated strong performance in handling binary queries, achieving 74% accuracy, significantly higher than NeuralSQL’s 48% in the same category. XMODE’s capability to adapt and re-plan tasks contributed to its robust performance, making it particularly effective in scenarios requiring detailed reasoning and cross-modal integration.
XMODE effectively addresses the limitations of existing multi-modal data exploration systems by combining advanced planning, parallel task execution, and dynamic re-planning. Its innovative approach allows users to query complex datasets efficiently, ensuring transparency and explainability. With demonstrated accuracy, efficiency, and cost-effectiveness improvements, XMODE represents a significant advancement in the field, offering practical applications in areas such as healthcare and art curation.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
The post This AI Paper Introduces XMODE: An Explainable Multi-Modal Data Exploration System Powered by LLMs for Enhanced Accuracy and Efficiency appeared first on MarkTechPost.
Leave a Reply