Autonomous Target Navigation with Object Delivery to a Human

Maryam Harakat (359826), Mohamed Taha Guelzim (355739), Muhammad Mikail Rais (402800)

COM-304 Final Project Report

Project Demonstration Video 1

Project Demonstration Video 2

Robot stopped in front of the target ball.
Robot stopped in front of the target ball.

Abstract

This project implements a real-time ball detection and tracking system for a mobile robot using the OAK-D Lite stereo camera and the ROS 2 Humble middleware. The systemleverages on-device YOLO inference provided by DepthAI, stereo depth data, and odometry feedback to estimate the relative position of a ball in 2D space. The main output of the system is a boolean flag indicating detection, and a position vector consisting of the estimated angle and distance to the ball. This data can be used for downstream navigation or manipulation tasks.

Table of Contents

Introduction

A. What is the problem we want to solve?

This project aims to develop an autonomous robot capable of navigating through obstacle-filled environments to locate, push, and deliver a predefined object to a moving human recipient. Unlike traditional navigation tasks, this requires dynamic target tracking and human interaction. The key challenge is designing a solution that doesn’t require specialized hardware, such as robotic arms or custom pushing mechanisms. This raises important research questions:

  • How does a robot perceive its environment?
  • How can a robot effectively navigate through and manipulate objects in mentioned environment without specialized hardware?
  • How can it track and follow a moving human recipient in real-time while ensuring accurate delivery?

B. Why is it important that this problem be solved?

Solving this problem is vital for advancing autonomous robots in human-centric applications, such as assisting elderly or disabled individuals, automating package delivery, and streamlining warehouse logistics. By eliminating the need for hardware modifications, our solution enables easy deployment on existing platforms, making it adaptable and scalable. This research addresses the broader question:

  • How can we make autonomous robots more versatile and efficient without requiring hardware changes, ensuring seamless integration into real-world environments?

C. Problem Reiteration and Primary Objectives

Ball detection and tracking is a fundamental problem in mobile robotics, particularly in applications involving object following, pick-and-place, or navigation toward targets. This project integrates real-time object detection, depth perception, and motion compensation to track a red sports ball using the OAK-D Lite camera and a TurtleBot 4 Lite running ROS 2 Humble. Hence, the primary objectives in this projects were:

  • Detect a ball using YOLO on the OAK-D Lite VPU.
  • Estimate the ball’s 2D relative position using stereo depth.
  • Integrate odometry to compensate for processing delay.
  • Publish clean outputs suitable for robot navigation logic.

A. Luxonis DepthAI Framework

DepthAI is an embedded AI vision platform enabling stereo depth and neural inference on-device via the Myriad X processor. Our system uses DepthAI’s YoloSpatialDetectionNetwork to perform real-time object detection and depth estimation on-device.

B. Visual Servoing and Object Following with ROS

The objective of this work—detect and track objects in real time using a mobile robot—is similar to our project goal, although their work relies on traditional vision methods (color segmentation, optical flow) and CPU-based processing. We improved robustness and speed by using depth sensing and YOLO, leading to more accurate and lower-latency tracking.

C. VSLAM and Odometry Integration

Mur-Artal et al (2015) presented their solution, ORB- SLAM, which uses visual data along with odometry to build accurate spatial understanding over time.

Method

A. Space Exploration and Mapping

To enable autonomous space exploration, our system integrates Simultaneous Localization and Mapping (SLAM) with custom frontier detection and navigation strategies. SLAM is a core technique that allows the robot to construct a map of an unknown environment while concurrently estimating its own position within that map. In our implementation, we utilized a SLAM algorithm in conjunction with ROS2 and RViz, enabling the construction and real-time visualization of a 2D occupancy grid. This grid map is generated using LiDAR data and categorizes the environment into free space, obstacles, and unknown regions. The exploration process begins with the initial mapping phase, where SLAM builds a foundational representation of the environment. A custom algorithm is then applied to detect frontiers—boundaries between known and unexplored areas of the map.

discoverer.py - convolute function
View on GitHub

Once identified, frontiers are evaluated based on specific selection criteria. The size of each frontier is assessed, with preference given to larger frontiers that potentially open access to broader unexplored areas. After a frontier is selected, the robot navigates to it using the Nav2 navigation stack. Navigation is supported by a costmap, which incorporates obstacle data to ensure safe and collision-free movement.

discoverer.py - goal_msg construction
View on GitHub

To help debugging and enhance situational awareness, we created a secondary map that visualizes waypoints and distinguishes between accessible and non-accessible points. This secondary map provides critical insight into the robot’s decision-making process during exploration and navigation.

Waypoints Map
Secondary map showing waypoints (dots), accessibility (red/green), current map (gray), frontiers (yellow) and target (blue star).

B. Hardware-Native Object Detection

A notable achievement in our project is that we managed to run YOLO inference natively on the Turtlebot camera. This enabled the robot to do real-time object detection while simultaneously run SLAM to map the environment. Firstly, we installed DepthAI library on the Turtlebot’s Raspberry Pi. DepthAI is an open-source framework developed by Luxonis that provides a way to interface with DepthAI hardware, one of which is the OakD Lite camera on Turtlebot. Because DepthAI devices use Intel’s Myriad X VPU, it only runs models compiled into a special binary format called .blob which stands for binary large object, but in the DepthAI world, it is a compiled model ready to run on-device. It is generated from standard model formats like ONNX, TensorFlow, or OpenVINO IR using a tool called blobconverter. Luxonis has an open model zoo that contains various object detection models, among them are YOLOv8n and YOLOv7. We had compared the performance and reliability of both model and decided to use YOLOv8n.

DepthAI Software showing Yolo model running on the camera, along with stereo depth estimation.
DepthAI Software showing Yolo model running on the camera, along with stereo depth estimation.

The DepthAI framework provides a convenient Python API to interact with the camera and run inference. We used the YoloSpatialDetectionNetwork class to create a pipeline that captures images, runs YOLO inference, and outputs detection results with spatial coordinates. The pipeline also provides depth information for each detected object, allowing us to estimate the 2D position of the ball in the robot’s coordinate frame.

simplified depth.py - BallDetector class
View on GitHub

The ball detection module runs continuously, capturing images and processing them in real-time.

C. Navigation to Detected Ball

1. Locating Ball Target: During navigation, the robot captures images at regular intervals as it moves through the environment. Once the robot reaches a selected frontier point (as given by the Nav2 navigation stack feedback), it performs a 360-degree rotation to further enhance its ability to detect any balls in its surroundings. This rotation ensures that the robot scans the entire environment, maximizing the likelihood of spotting a ball from any angle.


2. Acquiring Ball Target: Upon detecting the ball, we infer its position in the map. In the previous code snippet, we saw how we could get the angle of the ball’s center relative to the camera frame and the distance from the robot to the ball. Additionally, odometry correction is applied to account for the robot’s movement and ensure accurate positioning.

simplified depth.py - Ball position calculation
View on GitHub

Once the ball’s coordinates are established, the robot uses the Nav2 navigation stack again to plot a direct path to the ball’s location. This ensures that the robot efficiently moves toward the ball, adjusting its trajectory as necessary based on its real-time localization and the environment.

Experiments

Component Testing

As explained above, our project is composed of three main components: autonomous exploration of the environment, visual recognition of a colored ball, and navigation toward the detected ball. Each of these components addresses a fundamental robotics task, and each brings its own technical challenges. To develop a robust and reliable system, we adopted a modular and hypothesis-driven approach: we first implemented and tested each module independently, under controlled conditions, and then integrated them for full pipeline testing in diverse real-world environments. We deliberately chose to test each component in isolation before integration for two main reasons. First, this allowed us to identify and debug issues specific to each sub-system without the confounding influence of the others. Second, we hypothesized that performance bottlenecks and failure cases would be easier to interpret when isolated, making debugging and improvements more efficient. This hypothesis was largely confirmed in practice: by testing components separately, we were able to refine frontier exploration strategies, improve visual recognition robustness, and tune navigation parameters without cascading failures.

Exploration Module

‍The goal of the exploration module was to enable the robot to autonomously build a map of an unknown environment using SLAM and then use frontier-based planning to explore the entire space. We hypothesized that frontier-based exploration would generalize well across a variety of environments with different spatial structures. To test this, we deployed the robot in a wide range of real-world settings, including large open areas (such as the basement of the CO building), small rooms, classrooms, amphitheaters, and living spaces. We evaluated the quality of the exploration by visualizing the SLAM-generated maps in RViz and comparing them to the actual layout of the environment.

RViz visualization of the SLAM map in an amphitheater.
RViz visualization of the SLAM map in an amphitheater.

Our criteria included completeness of the exploration, the relevance of frontier points, and the consistency of the map structure. In early trials, the robot explored small and medium-sized rooms reliably, but failed in large spaces like the CO building basement—remaining in a small area and re-scanning already visited zones. This refuted our initial hypothesis that the exploration algorithm would generalize to all indoor environments. Debugging revealed the issue: the waypoint list wasn’t updated correctly as the map expanded, causing the robot to revisit outdated frontiers. We fixed this by (1) dynamically maintaining the waypoint list to remove stale entries and add new ones as the map grew, and (2) scaling the waypoint map to cover the full known environment.

discoverer.py - Waypoint list maintenance
View on GitHub

After these changes, exploration succeeded even in large, complex areas, validating our original hypothesis—once the underlying bug was resolved.

Ball Detection Module

For the recognition module, we chose to detect the ball using YOLO as stated before. We hypothesized that this method would be sufficient for real-time detection in indoor settings, and that lighting variations or partial occlusions would be the primary challenges. We first tested ball detection with the robot’s camera detached and stationary. We placed the ball in various positions: in plain sight, partially hidden, far from the camera, and close to cluttered objects.These initial tests confirmed that our recognition pipeline worked reliably under controlled lighting. However, when we reattached the camera to the moving robot and re-ran the same tests, we encountered new challenges. Motion blur and angle variations introduced noise into the detection pipeline. In the first implementation, during the 360° image capture, there wasn’t enough time of stabilization, photographs were therefore taken while the camera was partially in motion, and the resulting frames exhibited pronounced blur. The practical consequence was a drop in ball-detection recall during early scans. We hypothesized that the blur was caused not by sensor noise or lighting but by insufficient settling time before each shot. To test this, we inserted a short “settle 1 s-and-shoot” pause: the robot halts for 1 second, leaving time for frame capture, and only then advances to the next yaw increment. Subsequent experiments confirmed the hypothesis. Like this, we could correctly compute the relative coordinates of the ball when detected.

Robot performing a 360° rotation to capture 6 images.
Robot performing a 360° rotation to capture 6 images.

Navigation Module

The navigation component is responsible for planning and executing a path from the robot’s current position to the detected ball. Our hypothesis here was that once the ball is successfully localized in space, the existing path-planning algorithm (based on the robot’s SLAM map and obstacle avoidance) would be sufficient to approach the target reliably. To test this, we placed the ball in various positions within the robot’s field of view and initiated the navigation routine. The ball was placed in open areas, near walls, partially occluded, and among obstacles (but visible to the robot). The navigation system performed relatively well when the path to the ball was relatively clear and the SLAM map accurately reflected the environment. However, in cluttered settings or when the ball was placed very close to obstacles, the robot occasionally failed to compute a viable path or hesitated in its approach. By default, the cost map inflates obstacles—that is, it adds a safety buffer around them to prevent collisions.

RViz visualization of the cost map with inflated obstacles.
RViz visualization of the cost map with inflated obstacles.

However, in our case, the inflation radius was too large: objects appeared significantly bigger than they actually were, making narrow passages or cluttered environments effectively impassable in the planner’s view. As a result, when the ball was placed near obstacles, the robot often struggled to find a viable path, or took unnecessarily long detours. In practice, this made navigation toward the ball less reliable in constrained spaces. Fortunately, we identified this issue and have been able to reduce the cost and tweak with additionnal paramters from the NavPy config file.

nav2_params.yaml - Costmap parameters

Integration Testing

After thoroughly validating each component on its own, we moved on to integrated testing of the complete pipeline. These end-to-end tests were performed in a variety of real-world environments, such as classrooms, amphitheaters, and corridors in the CO building. The aim was to assess the overall coherence of the system—specifically, whether the transition between exploration, detection, and navigation occurred smoothly and whether the modules could function together without interference. These integrated experiments also allowed us to observe cumulative system behavior, such as total execution time, responsiveness, and robustness under varying spatial and visual conditions.

Limitations

Hardware Limitations

A significant limitation of our project was persistent hardware and connectivity problems. We lost valuable development time due to issues such as missing wheels, malfunctioning LiDAR and camera sensors, and frequent failures in maintaining a stable connection with the robot. These setbacks delayed progress and prevented us from dedicating more time to advanced functionalities we had initially planned. If the hardware had functioned correctly from the outset, we could have implemented more robust features and carried out more extensive testing.

Inaccurate Ball Detection

Our depth calculations were slightly inaccurate when images were captured while the robot was rotating. This was primarily due to motion blur and synchronization issues with odometry data. The robot's movement introduced inconsistencies between the captured image and the odometry feedback, leading to errors in the computed depth and position of the ball. To mitigate this, we should have averaged the computed positions from the last x images instead of relying solely on the first computed position. By aggregating data over multiple frames, we could have reduced noise and improved the reliability of the ball's estimated coordinates.

Visual Coverage Mapping

While the mapping algorithm performed well in terms of obstacle avoidance and constructing accurate maps, we did not observe substantial improvements in exploration behavior due to visual coverage mapping.

discoverer.py - Visual coverage mapping
View on GitHub
Visual coverage map in gray showing areas seen by the camera.
Visual coverage map in gray showing areas seen by the camera.

It seemed as though the algorithm was not fully leveraging the visual coverage data to optimize its exploration strategy. However, during our testing, the robot's 360-degree rotation behavior proved sufficient to cover the majority of the environment. This rotation allowed the robot to scan its surroundings effectively, ensuring that most areas were explored without requiring additional adjustments to the exploration algorithm.

Individual Contributions

Maryam

Maryam was primarily responsible for designing and implementing the core exploration algorithm. She developed the initial logic allowing the robot to autonomously explore unknown environments and identify frontier regions for navigation. She also implemented the 360° camera sweep mechanism at exploration start-up, enabling the robot to rotate on the spot and capture panoramic visual data. Furthermore, she integrated the logic to periodically capture images every 3 meters during exploration, laying the foundation for coupling vision with mapping.

Mohamed Taha

Mohamed Taha played a major role in debugging critical issues, including waypoint list maintenance, ensuring dynamic frontier updates as the SLAM map expanded, and resolving hardware connection problems between the robot and the controlling software stack. His contributions were essential to enabling robust and scalable exploration. He also worked on the development of the visual coverage map, a feature intended to track which areas had been seen by the camera, as distinct from those simply scanned by LiDAR.

Mikail

Mikail was in charge of the ball recognition pipeline using a YOLO-based object detection model. He handled the integration of the visual detector with the robot’s camera feed, and implemented the logic to extract 2D image coordinates of the ball, which could later be projected into real-world navigation targets. His work was crucial to enabling the robot to visually detect and locate its goal object.

Maryam and Mohamed Taha

Maryam and Mohamed Taha jointly conducted extensive testing across diverse real-world environments, including classrooms, amphitheaters, hallways, and basements. Their tests helped validate the system’s robustness and uncover limitations under varying spatial and visual conditions.

Extension

Delivering Objects

While we were unable to implement the extension of delivering objects to a human recipient due to time and hardware constraints, we identified a practical and simplified approach to achieve this functionality within the limits of our robot's capabilities. The delivery process would be based on straight-path planning and simple object pushing, with continuous feedback correction.

  1. Fixed Human Position: To mitigate the challenge of potential occlusion caused by the object during pushing, the recipient’s position would be assumed to be fixed. This avoids the need for continuous visual tracking, which may not be possible once the object blocks the camera's field of view.
  2. Straight Path Planning: The robot would compute the shortest, straight-line path from its current location to the fixed human position. This simplifies path planning and avoids the need for complex dynamic updates.
  3. Straight Pushing Behavior: Object delivery would rely on pushing the object directly along the planned path. During this phase, the robot would only move forward in a straight line while maintaining pressure against the object.
  4. LIDAR-Based Recalibration: The robot would use LIDAR data to monitor the alignment of the object during pushing. If the object begins to tilt to one side, the robot would slightly rotate in the opposite direction to re-center the object, ensuring more stable and accurate pushing.
  5. Continuous Path Correction: Although the target path is initially computed as a straight line, minor adjustments would be made throughout the pushing process to correct for drift and maintain overall alignment with the target direction.

This approach offers a lightweight and achievable path toward object delivery with limited hardware. Future work could refine the LIDAR-based correction strategy, explore partial human tracking pre-push, and evaluate delivery reliability under varied terrain and object shapes.

Conclusion

A. Summary of Findings

In this project, we successfully developed a mobile robot capable of autonomously navigating through complex environments to locate a given target object. The system integrates real-time ball detection, depth estimation, and odometry correction to achieve accurate localization and navigation. Our findings indicate that the combination of YOLO-based object detection and stereo depth sensing provides a robust solution for real-time target tracking, enabling the robot to effectively navigate towards the detected ball.

B. Future Work

Future work should focus on enhancing the robustness of the navigation system, particularly in cluttered environments. This could involve refining the cost map parameters, improving the ball detection algorithm to handle occlusions and lighting variations, and integrating more advanced path-planning techniques. Additionally, exploring alternative hardware configurations and sensors could further improve the system’s performance and adaptability.

C. Final Thoughts

Overall, this project demonstrates the potential of combining hardware-native object detection with real-time depth sensing and odometry correction to enable autonomous navigation and manipulation tasks in robotics. The successful integration of these components highlights the importance of modular design and hypothesis-driven testing in developing robust robotic systems.