This is a complete, production-ready AI-powered system for migrating CUDA codebases to Intel SYCL/DPC++ using Agentic AI coding tool like Claude Code. The system automates the entire migration workflow from initial scanning through performance optimization.
┌─────────────────────────────────────────────────────────────────┐
│ CLAUDE.md │
│ (Main Orchestration Agent) │
│ │
│ Manages Multi-Phase Migration Workflow: │
│ 1. Repository Analysis │
│ 2. CUDA Build Setup │
│ 3. CUDA Testing │
│ 4. SYCL Translation │
│ 5. SYCL Testing │
│ 6. Remote Execution & Validation │
│ 7. Performance Optimization │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Subagents │ │ Skills │ │ Tools │
├──────────────┤ ├──────────────┤ ├──────────────┤
│ @cuda-scanner│ │scan-cuda-repo│ │ SSH/SFTP │
│ @cmake- │ │analyze- │ │ CMake │
│ builder │ │ complexity │ │ Google Test │
│ @cuda-test- │ │generate-cmake│ │ CUDA Toolkit │
│ generator │ │create-cuda- │ │ Intel oneAPI │
│ @sycl- │ │ tests │ │ VTune │
│ translator │ │ │ │ Advisor │
│ @sycl-test- │ │ │ │ Nsight │
│ generator │ │execute-remote│ │ │
│ @remote- │ │ -ssh │ │ │
│ executor │ │compare- │ │ │
│ @performance-│ │ numerical- │ │ │
│ optimizer │ │ results │ │ │
│ │ │profile-gpu- │ │ │
│ │ │ kernel │ │ │
│ │ │optimize-sycl-│ │ │
│ │ │ kernel │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
- CLAUDE.md - Main agent orchestration
- Multi-phase workflow management
- Subagent coordination
- Quality assurance checks
- User interaction flow
-
subagents/cuda-scanner.md
- Repository scanning
- Kernel inventory creation
- Complexity assessment
- Dependency analysis
-
subagents/cmake-builder.md
- CMakeLists.txt generation
- CUDA/SYCL build configuration
- Test framework integration
-
subagents/cuda-test-generator.md
- Google Test suite creation for CUDA
- Input/output serialization
- Benchmark integration
-
subagents/sycl-translator.md
- CUDA to SYCL translation
- Intel GPU optimizations
- Semantic preservation
-
subagents/sycl-test-generator.md
- Google Test suite creation for SYCL
- Mirror CUDA test structure
- Result comparison setup
-
subagents/remote-executor.md
- SSH connection management
- Remote build and execution
- Result collection
- Iterative validation
-
subagents/performance-optimizer.md
- Theoretical analysis
- Roofline modeling
- Bottleneck identification
- Optimization application
-
scan-cuda-repo
- Recursive CUDA file discovery
- Kernel function extraction
- Feature detection
- External library identification
-
analyze-kernel-complexity
- Operation counting
- Memory analysis
- Arithmetic intensity calculation
- Roofline positioning
-
generate-cmake
- Template-based CMake generation
- CUDA/SYCL configuration
- Cross-platform support
- Dependency management
-
create-cuda-tests
- Test fixture generation
- Resource management
- Benchmark suite creation
- Serialization utilities
-
execute-remote-ssh
- Paramiko-based SSH
- File transfer (SFTP)
- Remote command execution
- Workflow automation
-
compare-numerical-results
- NumPy-based comparison
- Tolerance checking
- Mismatch pattern analysis
- Diagnostic reporting
-
profile-gpu-kernel
- VTune/Advisor integration
- Nsight integration
- Metrics extraction
- Roofline generation
-
optimize-sycl-kernel
- Code transformation
- Intel attribute insertion
- Performance validation
- Optimization selection
- README.md
- Complete system documentation
- Installation instructions
- Usage examples
- Troubleshooting guide
-
Install Claude Code:
# Follow instructions at code.claude.com npm install -g @anthropic-ai/claude-code -
Set up the migration system:
# Create project structure mkdir cuda-to-sycl-migration cd cuda-to-sycl-migration # Copy all files cp /path/to/CLAUDE.md . mkdir -p subagents skills cp /path/to/subagents/*.md subagents/ cp /path/to/skills/*.{py,md} skills/ # Make skills executable chmod +x skills/*.py
-
Install Python dependencies:
pip install numpy paramiko matplotlib
-
Configure GPU servers:
# Set up SSH keys ssh-keygen -t rsa -b 4096 ssh-copy-id user@nvidia-gpu-server ssh-copy-id user@intel-gpu-server
# Navigate to CUDA repository
cd /path/to/cuda/project
# Start Claude Code
claude-code
# In Claude Code interface:User: I want to migrate this CUDA codebase to SYCL for Intel Data Center GPU Max 1550.
Claude: I'll help you migrate this CUDA codebase to SYCL. Let me start by scanning your repository to identify all CUDA kernels.
[Claude invokes @cuda_scanner...]
Claude: I found 15 CUDA files with 42 kernels. Here's the breakdown:
- Simple kernels: 25 (straightforward migration)
- Moderate kernels: 15 (require careful translation)
- Complex kernels: 2 (need manual review)
To proceed, I need:
- NVIDIA GPU server SSH details
- Intel GPU server SSH details
- Performance target (default: 80% of theoretical peak)
- Numerical tolerance (default: rtol=1e-5, atol=1e-8)
User:
- NVIDIA: gpu1.example.com, user: cuda_user, key: ~/.ssh/id_rsa
- Intel: gpu2.example.com, user: sycl_user, key: ~/.ssh/id_rsa
- Target: 85%
- Tolerance: default is fine
Claude: Perfect! I'll begin the migration process...
[System proceeds through all phases automatically]
- Complete end-to-end automation
- Minimal manual intervention required
- Intelligent error recovery
- Numerical comparison with configurable tolerance
- Mismatch pattern analysis
- Iterative debugging (max 10 iterations)
- CPU reference validation
- Theoretical analysis (roofline model)
- Profiler integration (VTune, Advisor, Nsight)
- Intel-specific optimizations
- Automated optimization application
- Google Test framework
- Input/output serialization
- Benchmark metrics collection
- Parameterized tests
- Error handling throughout
- Detailed logging
- Progress reporting
- Clean directory organization
| Phase | Duration | Description |
|---|---|---|
| 1. Scanning | 1-5 min | Repository analysis |
| 2. CUDA Build | 5-15 min | Build system setup |
| 3. CUDA Testing | 10-30 min | Test generation and execution |
| 4. SYCL Translation | 10-60 min | Kernel translation |
| 5. SYCL Testing | 10-30 min | Test generation |
| 6. Validation | 20-120 min | Remote execution and comparison |
| 7. Optimization | 30-180 min | Performance tuning |
| Total | 1.5-7 hours | Complete migration |
| Kernel Type | Expected SYCL Performance |
|---|---|
| Memory-bound | 70-85% of bandwidth |
| Compute-bound | 75-90% of peak FLOPS |
| Well-optimized GEMM | 80-95% with matrix engines |
# Migrate multiple kernels in parallel
for kernel in kernel_list:
migrate_kernel(kernel, parallel=True)# Override default optimization selection
optimizations = {
'matrixMul': ['tiling', 'subgroup_ops', 'vectorization'],
'vectorAdd': ['vectorization', 'coalescing']
}# Configure for different Intel GPU platforms
platform_configs = {
'pvc': { # Ponte Vecchio (Data Center GPU Max)
'work_group_size': 256,
'subgroup_size': 16,
'tile_size': 32
},
'acm-g10': { # Arc GPU
'work_group_size': 128,
'subgroup_size': 16,
'tile_size': 16
}
}1. SSH Connection Failures
# Test connection manually
ssh -i ~/.ssh/id_rsa user@gpu-server
# Check firewall
telnet gpu-server 22
# Verify key permissions
chmod 600 ~/.ssh/id_rsa2. Numerical Mismatches
- Check thread indexing translation
- Verify synchronization points
- Increase tolerance if appropriate
- Review floating-point precision
3. Performance Below Target
- Run profiler to identify bottleneck
- Check memory access patterns
- Verify work-group size is optimal
- Consider algorithm changes
4. Build Failures
- Verify compiler versions
- Check CMake version (≥3.18 for CUDA, ≥3.20 for SYCL)
- Review error messages for missing dependencies
# my_custom_agent.md
You are a specialized agent for [specific task].
## Responsibilities
- ...
## Skills to Use
- ...
## Output Format
- ...#!/usr/bin/env python3
"""
Skill: my-custom-skill
Description of what this skill does.
"""
def my_skill_function(args):
# Implementation
return results
if __name__ == "__main__":
# CLI interface
pass# Test individual skills
python skills/scan_cuda_repo.py ./test_data
python skills/compare_numerical_results.py cuda_output.bin sycl_output.bin# Test end-to-end on sample project
claude-code test-migration ./sample_cuda_project- Translation accuracy: ~95% semantic correctness
- Performance target achievement: 80-90% of kernels meet target
- Time savings: 10-20x faster than manual migration
- Error rate: <5% requiring manual intervention
- Start with simple kernels to validate the workflow
- Run validation frequently during migration
- Profile early to understand performance baseline
- Document manual changes for complex kernels
- Version control everything including generated files
- Test on representative data sizes
- Validate on multiple inputs to ensure robustness
- Claude Code: https://code.claude.com/docs
- SYCL Specification: https://www.khronos.org/sycl/
- Intel oneAPI: https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html
- Intel VTune: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
- Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
- NVIDIA Nsight: https://developer.nvidia.com/nsight-compute
This migration system is provided as-is for use with Claude Code.
Ready to migrate?
- Install Claude Code
- Copy all system files to your project
- Run
claude-codein your CUDA repository - Say: "Migrate this CUDA code to SYCL"
The system handles the rest automatically!