LLM Agent control of a Autonomous Robot

Overview

This project explores using advanced Large Language Models (LLMs) for controlling a ROS2-based robot (A24). By integrating LangChain for text-based control and LiveKit WebRTC for speech-based interaction, we demonstrate how LLMs can bridge the gap between humans and robots. The project also leverages OpenAI’s API for real-time decision-making, enabling natural language queries to translate into robotic actions. Be sure to check out the associated publication on my website for this project.

Objectives

Use LangChain to allow text-to-text control of the A24 robot.
Use LiveKit WebRTC to enable speech-to-speech interaction with the robot.
Enhance robot capabilities with ROS2, Nav2, and custom tools for robust navigation and control.
Demonstrate tool chaining with LangChain for complex queries.

Key Features

LangChain Integration
- Added tools to LangChain, allowing the robot to:
  - Navigate to specific coordinates.
  - Capture and describe images using OpenAI’s API.
  - Determine distances from depth cameras.
  - Echo ROS2 topics for debugging and odometry readings.
- Tool chaining enables handling multi-step queries, e.g., determining the robot’s location and time.
LiveKit WebRTC
- Developed a speech-to-speech interaction system using LiveKit’s real-time communication platform.
- Integrated a pipeline: Speech-to-Text (STT) → LLM Query → Text-to-Speech (TTS).
- Allowed seamless interaction through a smartphone or other devices.
ROS2 Integration
- Controlled robot navigation using the Nav2 stack and SLAM Toolbox.
- Used a Gazebo simulation to validate the system’s performance.
- Enhanced localization with Adaptive Monte Carlo Localization (AMCL).
Simulation Setup
- Simulated the A24 robot in a Gazebo environment representing a home.
- Generated maps using slam_toolbox and navigated autonomously within the environment.

Results

LangChain

Successfully executed commands like:
- “Move forward.”
- “Where are you?” (using /odom topic for coordinates).
Demonstrated robust tool chaining to solve multi-step queries.

LiveKit

Enabled speech-to-speech queries such as:
- “Make the robot move forward.”
- “Stop the robot.”
Highlighted limitations in tool chaining compared to LangChain.

Generated accurate maps using SLAM Toolbox and navigated autonomously.
Tested navigation with waypoints, achieving high accuracy in localization.

Tools and Libraries Used

LangChain: LLM framework for integrating tools.
LiveKit WebRTC: Real-time speech-to-speech interaction.
OpenAI API: For natural language processing and image descriptions.
ROS2: Robot Operating System for middleware and control.
Nav2: Navigation stack for autonomous movement.
slam_toolbox: For mapping and localization.
Gazebo: Robot simulation environment.

Repository Structure

agent_code/:
- Contains LangChain and LiveKit agents.
- Includes tools for image recognition, navigation, and querying ROS2 topics.
launch/:
- Launch files for AMCL, SLAM, and navigation.
maps/:
- Includes pre-generated maps for testing navigation.
urdf/:
- Unified Robot Description Format files for the A24 robot.

Challenges and Learnings

Tool Chaining:
LangChain demonstrated better tool chaining capabilities than LiveKit, allowing multi-step task execution.
Sim2Real:
Testing in Gazebo provided insights into real-world deployment challenges, especially in localization and navigation.

Future Work

Integrate the system with a physical A24 robot for real-world testing.
Expand the toolset for LangChain and LiveKit to handle more complex tasks.
Improve LiveKit’s ability to chain tools for multi-step queries.

Conclusion

This project showcases the power of LLMs in human-robot interaction. By combining LangChain’s tool integration with LiveKit’s real-time communication capabilities, we demonstrate a scalable, low-computation solution for controlling autonomous robots using natural language.