The TLDR; after lots of research- Don't use consumer SSDs. Only use enterprise SSDs.
Attempt / Experiment Number 2.
I ended up ordering 5x 1T Samsung PM863a enterprise sata drives.
After, reinstalling ceph, I put three of the drives into kube05, and one more into kube01 (no ports / power for adding more then a single sata disk...).
And- put the cluster together. At first, performance wasn't great.... (but, was still 10x the performance of the first attempt!). But, after updating the crush map to set the failure domain to OSD rather then host, performance picked up quite dramatically.
This- is due to the current imbalance of storage/host. Kube05 has 3T of drives, Kube01 has 1T. No storage elsewhere.
BUT.... since this was a very successful test, and it was able to deliver enough IOPs to run my I/O heavy kubernetes workloads.... I decided to take it up another step.
A few notes-
Can you guess which drive is the samsung 980 EVO, and which drives are enterprise SATA SSDs? (look at the latency column)
Future - Attempt #3
The next goal, is to properly distribute OSDs.
Since, I am maxed out on the number of 2.5" SATA drives I can deploy... I picked up some NVMe.
5x 1T Samsung PM963 M.2 NVMe.
I picked up a pair of dual-spot half-height bifurcation cards for Kube02. This will allow me to place 4 of these into it, with dedicated bandwidth to the CPU.
The remaining one, will be placed inside of Kube01, to replace the 1T samsung 980 NVMe.
This should give me a pretty decent distribution of data, and with all enterprise drives, it should deliver pretty acceptable performance.
Ceph works best if you have identical osd, quantity, type and capacity across the cluster, also works best on a 3+ node cluster.
I ran a mixed sata SSD/HDD 256gb/4tb cluster and it was always a bit pants. Now I have 7x1tb SSD per node (4nodes) and it works fantastic now.
Proxmox uses replica 3/2 failure at host level but you may find that EC works better for your mixed infra as you noticed you can't meed the 3 host failure and so setting to osd failure level means data may be kept on a single host so would need to traverse the network to the other machine.
You may also need more than a single 10Gb nic too as you might start hitting bandwidth issues.
I ended up having to set the failure domain to OSD, rather then host.... at least, until the next group of 5 enterprise SSDs arrives to properly distribute data across all three nodes. But.... once the next group of 5 arrives, it will allow me to setup a fairly even distribute of data across all three 10G nodes.
You may also need more than a single 10Gb nic too as you might start hitting bandwidth issues.
Knock on wood, I don't "think" I have enough heavy bandwidth loads for this to be a huge issue, at least, with the exception of when the backups are running. Most of my workloads use fast random I/O. (databases, kubernetes, etc.)
BUT.... I do have 40g networking on the r730xd already, and I have enough 40G NICs laying around to build a full mesh 40G network between those three nodes if needed.
So my production setup is 2x10Gb bonded NICs for networking and 2x10Gb bonded NICs for Ceph/Cluster stuff. I suspect that when ceph is being heavily used you may see bottlenecks however once you have host based failure then in theory your data should be closer to the correct host and not have an issue. But it's on a basic level like have 3 copies of data, one on each host so it doesn't save you any storage, just reduces the risks during failure.
Thinking about it, you may actually see better results with ZFS and replicate jobs. As there's fewer overheads and the ZFS sending is incremental. You'd obviously just loose X minutes of data instead of ceph being X seconds.
you may actually see better results with ZFS and replicate jobs
Oh, I know the performance is drastically better doing that. I did play with it, and it works for the most part. Performance is dramatically better, but I have peace of mind knowing that is a host just magically craps itself, the data is already ready to go and the machine has already fired up on the new host without any issues.
Also, there is something fun about literally tossing over 6 million IOPs worth of SSDs into my cluster, just to barely squeeze 50k IOPs out of ceph!
I have 5 more "enterprise" NVMes arriving tuesday, which will complete my ceph cluster.
Current, I have 4 of the enterprise SATA SSDs in place, and a single 980 as a placeholder.
Nothing at all to write home about. BUT, I do think the lack of distributed drives is making an impact. My most powerful host, doesn't have any OSDs yet, still waiting on the NVMe to arrive.
During heavy benchmarking, the limitations of the consumer 980 evo became pretty apparent, when its latency spiked through the moon.
The addition of the new 5 NVMe should make a pretty dramatic difference. If I can squeeze 100k IOPs, I will be happy. (Despite.... over 6 million IOPs worth of SSDs...)