Voice-Controlled Drone Using a Fine-Tuned Large Language Model (LLM)

System Demonstration

Drone in Action (Gazebo)

Drone Running via Python Script

LLM-Based Drone Control Pipeline
Figure 1: End-to-End LLM-Based Voice-Controlled Drone Pipeline

Overview

This project presents a voice-controlled unmanned aerial vehicle (UAV) system powered by a fine-tuned Large Language Model (LLM) that enables robust, natural-language-based drone control in complex real-world environments. Instead of relying on rigid command formats, the system interprets human intent from diverse, noisy, and ambiguous voice commands and converts them into structured drone actions.

The proposed framework is designed to be application-agnostic, making it suitable for use in precision agriculture, industrial inspection sites, construction and civil engineering projects, surveying and mapping operations, and search-and-rescue or disaster-response scenarios.

Motivation

Traditional drone control systems depend on either manual joystick input or strictly formatted voice commands. However, in real operational environments, operators often have occupied hands, situations demand fast intuitive interaction, commands vary significantly between users, and background noise is common.

For example, the same drone action may be expressed as “Take off to 5 meters”, “Go up 5 meters”, or “Launch and climb 5 m”. Conventional NLP pipelines struggle to handle such linguistic variability.

Objectives

Challenges in Real-World Drone Operation

Drones are increasingly used in environments where operators may be walking, climbing, or inspecting structures, visual attention is divided, and immediate decision-making is required.

These conditions expose the limitations of rule-based or traditional NLP-based voice control systems.

Proposed Solution: LLM-Based Voice-Controlled Drone System

This work proposes the use of a fine-tuned LLM to directly model the semantic intent behind voice commands, rather than relying on surface-level keyword matching.

Dataset Preparation (LLM-Assisted)

The dataset covers seven fundamental drone actions: ARM, DISARM, TAKEOFF, LAND, MOVE, RTL, and STOP.

Dataset Format Screenshot
Figure 2: Dataset Format and Instruction-Style Labeling

The dataset is generated using base command templates, LLM-based paraphrasing with DistilGPT-2, parameter randomization, and instruction-style JSON labeling.

LLM Fine-Tuning Pipeline

The base model used is Phi-3 Mini (3.8B parameters), selected due to its strong instruction-following capability and ability to operate within 8GB GPU constraints.

Fine-tuning was performed using supervised fine-tuning with LoRA, 4-bit quantization, and optimized training settings to prevent overfitting.

Model Evaluation

Confusion Matrix
Figure 3: Confusion Matrix of Drone Command Classification

The model achieved an overall accuracy of 68.3% across seven command classes. Strong performance was observed for LAND, MOVE, and TAKEOFF, while safety-critical commands such as STOP showed confusion with MOVE.

System Testing and Deployment

The system was validated in a Gazebo simulation environment, confirming voice-to-command translation, structured command generation, and Python-based control integration.

Real-drone deployment was not performed due to flight controller constraints; however, the architecture is fully transferable to physical UAV platforms.

Significance and Future Work

This work demonstrates the feasibility of LLM-driven voice control for UAVs operating in complex, real-world environments.