
Embodied Robot Task Planning Training Platform RAI-P4
An all-in-one training platform for embodied interactive agents, integrating foundation models, voice, vision, robotic arms, and gimbals. Compatible with mainstream models such as DeepSeek, Qwen, and Doubao, and supports OpenCV, YOLO, and multimodal VLM workflows. It includes practical course modules covering agent design from concept to deployment, robot control, agent development, and vision integration for professional university training.
Applicable Audience/Scenarios
Ideal for university programs in artificial intelligence, robotics, automation, and computer science that span LLM applications, computer vision, machine learning, deep learning, embedded development, sensing and control, ROS, robotics, simulation, and intelligent system integration.
Highlights
- Unified platform for AI speech, AI vision, and manipulator control
- Desktop-friendly footprint (60 cm × 60 cm) for rapid deployment
- Progressive path supporting 4-DOF through 6-DOF manipulators
Product Features
Integrated AI and robotics stack
Built around the requirements of an intelligent manipulator, the platform combines AI speech interaction, AI vision recognition, an AI edge board, and sensors such as color recognition and IMU modules—supporting the full chain from perception to decision and execution.
Ready-to-run teaching deployment
Hardware and software are pre-aligned at the factory. No extra PCs or tooling are needed; a 60 cm × 60 cm desktop is enough to launch experiments in labs, innovation studios, or mobile workshops.
Progressive manipulator training
Starts with a 4-DOF serial manipulator and scales to typical 6-DOF configurations. Complementary exercises cover kinematics, motion control, simulation, and ROS so students can advance step by step.
Lab Scenarios
Lab Scenarios
Configuration
Sensor Configuration
Provides multimodal inputs required for embodied task planning, covering speech, vision, and motion feedback.
- AI speech interaction microphone array
- Vision pan-tilt camera module
- Posture sensing IMU
- Manipulator camera module
Controller Configuration
An AI edge board with open I/O delivers both LLM/vision inference and manipulator/peripheral control, ensuring tight hardware–software integration.

Software Configuration
Ships with Ubuntu and ROS2 (roscore, RViz, MoveIt), plus Jupyter, VS Code, and Python 3.9 so classes can start deploying algorithms immediately.
Compatible with mainstream AI/robotics ecosystems such as OpenCV, YOLO, LLM SDKs, and MoveIt, supporting both teaching and research.

Experiments
More than 40 sub-projects span manipulator control, sensing, computer vision, LLM voice dialogue, system integration, ROS, and embedded development, enabling cross-disciplinary skill building.
Manipulator control fundamentals
- Manipulator kinematics control:4 class hours | Build forward/inverse kinematics and joint trajectory planning for the 4-DOF arm.
- Linear interpolation control:2 class hours | Execute end-effector linear trajectories and manage velocity/acceleration profiles.
- Circular interpolation control:2 class hours | Generate spatial arc trajectories while maintaining attitude control.
- Stacking and handling tasks:4 class hours | Combine coordinate calibration and grasp strategies to plan multi-point handling.
- Drawing geometric patterns:4 class hours | Produce planar geometric figures through custom trajectory generation.
Sensor acquisition & control
- IMU data acquisition:2 class hours | Read posture sensor data, complete orientation estimation, and apply filtering.
- Gesture-controlled manipulator:2 class hours | Drive the manipulator via posture sensor input for embodied interaction.
Computer vision basics (OpenCV)
- Color recognition:2 class hours | Convert color spaces and segment targets with OpenCV.
- Shape recognition:2 class hours | Extract contours and match geometric features to classify shapes.
AI vision (YOLO)
- YOLO deployment:2 class hours | Deploy YOLO on the embedded board for real-time inference.
- Face detection:2 class hours | Load pretrained weights to detect faces and output bounding boxes.
- Face tracking:2 class hours | Combine the pan-tilt unit and vision feedback for dynamic face tracking.
- Dataset annotation:2 class hours | Annotate detection datasets and handle format conversions.
- Model training & deployment:2 class hours | Fine-tune, quantize, and deploy YOLO models.
- Workpiece inspection:2 class hours | Build application-specific detection to classify and localize workpieces.
AI vision (Tongyi Qianwen multimodal)
- Qwen multimodal API deployment:2 class hours | Call the Tongyi Qianwen API for image understanding and text generation.
- Fruit detection & annotation:2 class hours | Use Tongyi Qianwen to recognize fruit targets and generate semantic labels.
LLM applications (AI voice dialogue)
- ASR deployment:2 class hours | Configure the Tongyi Qianwen ASR service to parse voice input.
- LLM semantic planner:2 class hours | Deploy DeepSeek to handle intent understanding and task planning.
- TTS deployment:2 class hours | Integrate Volcano Engine TTS for natural voice responses.
- End-to-end voice dialogue:2 class hours | Chain ASR, LLM, and TTS to build a full conversational loop.
- Function-call voice calculator:2 class hours | Implement voice-driven calculations via LLM function calls.
- Function-call music playback:2 class hours | Control music retrieval and playback through voice commands.
- Function-call pan-tilt task planner:4 class hours | Use voice instructions to drive pan-tilt tracking and target search.
- Function-call manipulator task planner:4 class hours | Trigger vision-based positioning and grasping through voice commands.
Robotic system integration
- Socket communication:2 class hours | Build a socket channel and exchange commands between subsystems.
- Vision-driven manipulator tracking:4 class hours | Map vision data to the manipulator coordinate frame for dynamic tracking.
- Vision–manipulator hand–eye calibration:2 class hours | Complete hand–eye calibration to map pixels to poses.
- Vision-based sorting:4 class hours | Combine perception, planning, and execution to complete sorting tasks.
ROS (Robot Operating System)
- Run a ROS2 project quickly:2 class hours | Create, build, and run ROS2 workspaces.
- Build and port ROS2 packages:2 class hours | Create packages, manage dependencies, and port functionality.
- MoveIt configuration:2 class hours | Configure MoveIt scenes, import collision models, and validate planning.
- 4-DOF MoveIt/RViz simulation:2 class hours | Control the 4-DOF arm in RViz and verify trajectories.
Embedded system development
- Ubuntu filesystem essentials:1 class hour | Learn common directory structures and file commands.
- Editor familiarization (vi / nano):1 class hour | Practice terminal editor basics and configuration.
- Remote access setup (SSH / PuTTY):2 class hours | Configure remote connections for collaborative development.
- Linux file I/O programming:2 class hours | Implement file read/write with proper exception handling.
- Serial communication:2 class hours | Exchange serial data and design simple protocols.
- Process / thread management:2 class hours | Understand Linux processes and threads and write sample programs.
- Interface design:2 class hours | Quickly build human–machine interfaces with Python/Qt.
Knowledge Base
Get more technical documentation, tutorials, and FAQs about this product.
Key Questions
Q3: Which products support LLM integration and what can they do?Embodied Composite Robot System Design Training Platform RAI-M4Embodied Robot Task Planning Training Platform RAI-P4Embodied Vision Perception & Decision Training Platform RAI-Q2
Answer: There are three products that support large-model integration:
RAI-P4: integrates Qwen, DeepSeek, and Volcano Engine; supports ASR (Qwen), LLM (DeepSeek), TTS (Volcano Engine), and function calling (such as voice-dialog calculators, music playback, and gimbal / robotic arm task planning), and also supports integrated applications with YOLO, face tracking, and robotic arm control.
RAI-M4: connects to DeepSeek (LLM) and Qwen (ASR + multimodal); supports converting natural language into robot task workflows (voice commands for chassis / robotic arm control) and multimodal object detection (Qwen), combining a mecanum chassis and a 4-axis robotic arm to achieve generalized manipulation.

