I first started this hobby almost a year ago. Llama 3 8b had released a day or so prior. I had finally caught on and loaded up a llamafile on my old thinkpad.

It only ran at 0.7-1 t/s. But it ran. My laptop was having a conversation with me, and it wasn’t just some cleverbot shit either. I was hooked man! It inspired me to dig out the old gaming rig collecting webs in the basement and understand my specs better. Machine learning and neural networks are fascinating.

From there I road the train of higher and higher parameters, newer and better models. My poor old nvidia 1070 8gb has its limits though as do I.

I love mistral models. 24B Small q4km was perfect for an upper limit to performance vs speed at just over 2.7-3t/s. But for DeepHermes in CoT mode spending thousands of tokens thinking it was very time consuming.

Well, I neglected to try DeepHermes 8b based off my first model, llama 3. Until now. I can fit the highest q6 on my card completely. Ive never loaded a model fully on vram always partial offloading.

What a night and day difference it makes! Entire paragraphs in seconds instead of a sentence or two. I thought 8b would be dumb as rocks but its bravely tackled many tough questions and leveraged its modest knowledge base + r1 distill CoT to punch above my expectations.

Its absolutely incredible how far things have come in a year. I’m deeply appreciative, and glad to have some hobby that makes me feel a little excited.