memory-bandwidth

C++23 benchmarking framework with 6 profiler backends, CUDA GPU support, statistical regression detection, cross-compilation for 5 architectures, and CLI tools for analysis and visualization.

Updated May 2, 2026
C++

parallelArchitect / pascal-um-benchmark

Star

Reproducible Pascal GPU Unified Memory benchmark with Nsight and nvprof profiling

benchmark-suite memory-bandwidth page-faults nvprof unified-memory nsight-systems cuda-pascal-unified-memory-gpu nsight-nvprof-pcie-bandwidth cudamallocmanaged cudamemprefetchasync cudapascal

Updated Feb 1, 2026
Python

srvr-farm / memwatch

Star

Console UI for watching Memory stats

linux rust terminal tui perf performance-monitoring pmu memory-bandwidth memory-monitor dmidecode hardware-monitor ram-monitor ratatui

Updated May 7, 2026
Rust

PV-J / hetero-memory-lab

Star

Python lab for exploring memory bandwidth, cache effects, and locality in accelerator workloads

performance-engineering hpc cuda memory-bandwidth gpu-performance roofline-model cache-locality tiling-optimization

Updated Dec 24, 2025
Python

ukri-bench / benchmark-s-babelstream

Star

Benchmark for memory bandwidth

benchmark synthetic memory-bandwidth

Updated Jun 30, 2025
Python

ahmadrezarazian / OpenCL_MultiDevice_Bandwidth_Analyzer

Star

OpenCL benchmarking tool to measure host-device bandwidth and kernel global memory throughput across GPUs and CPUs.

opencl parallel-computing memory-bandwidth gpu-benchmark gpu-validation compute-benchmark

Updated Mar 16, 2026
C

varad-more / fused-triton-rmsnorm-residual-qkv

Star

Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.

cuda pytorch transformer triton llama memory-bandwidth gpu-kernels kernel-fusion rmsnorm llm-inference

Updated Apr 22, 2026
Python

jman4162 / Sizing-AI-Training-by-Cost-per-Memory-Bandwidth

Star

A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next

aws distributed-systems machine-learning ai ml pytorch transformer aws-ec2 hbm systems-performance memory-bandwidth cost-optimization distributed-training nccl roofline-model llm llm-training ai-infrastructure

Updated Sep 4, 2025
Jupyter Notebook

UnrealJon / DTDR

Star

Transform-domain representation enabling 3–4× storage reduction with direct ANN search and novel multi-resolution signals. UK patent application under accelerated examination (Green Channel).

machine-learning embeddings similarity-search memory-bandwidth vector-database ann-search approximate-nearest-neighbor transform-domain

Updated May 5, 2026
Jupyter Notebook

Vinayk393 / CECS530-TokenGenerationLatency

Star

Bandwidth-focused LLaMA token-generation latency benchmarking on Apple Silicon M4 vs M2 with TTFT, PTL, KV-cache, quantization projections, and reproducible architecture analysis.

benchmarking pytorch llama mps computer-architecture quantization memory-bandwidth kv-cache apple-silicon llm-inference

Updated May 10, 2026
Python

VSJ001 / Cache-Aware-and-GPU-Accelerated-Sparse-Matrix-Vector-Multiplication

Star

CUDA SpMV kernels (scalar, warp-per-row, ELL) on NVIDIA A100 benchmarked against cuSPARSE on SuiteSparse matrices, plus AVX2 + cache-tiled CPU baselines on Intel Xeon Gold. Vector kernel reaches 98-110% of HBM2 peak, beating cuSPARSE by 24-56% on regular matrices.

c performance-engineering cpp hpc gpu parallel-computing cuda nvidia simd high-performance-computing avx2 cuda-kernels spmv sparse-matrix memory-bandwidth cusparse cache-optimization

Updated May 8, 2026
Cuda

Improve this page

Add a description, image, and links to the memory-bandwidth topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the memory-bandwidth topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory-bandwidth

Here are 25 public repositories matching this topic...

UoB-HPC / BabelStream

intel / memory-bandwidth-benchmarks

ZigZag-Project / zigzag-v1

SpareCores / sc-membench

ISTEQ-BV / OpenFOAM_Workshop_2023_Demo

hkatsura / AmorphousMemoryMark

caps-tum / mmbwmon

ISTEQ-BV / ET

hkimw / llm-bottleneck-lab

apexedgesystems / vernier

parallelArchitect / pascal-um-benchmark

srvr-farm / memwatch

PV-J / hetero-memory-lab

ukri-bench / benchmark-s-babelstream

ahmadrezarazian / OpenCL_MultiDevice_Bandwidth_Analyzer

varad-more / fused-triton-rmsnorm-residual-qkv

jman4162 / Sizing-AI-Training-by-Cost-per-Memory-Bandwidth

UnrealJon / DTDR

Vinayk393 / CECS530-TokenGenerationLatency

VSJ001 / Cache-Aware-and-GPU-Accelerated-Sparse-Matrix-Vector-Multiplication

Improve this page

Add this topic to your repo