Use the vitals package with ellmer to evaluate and compare the accuracy of LLMs, including writing evals to test local models.
Users running a quantized 7B model on a laptop expect 40+ tokens per second. A 30B MoE model on a high-end mobile device ...
Nvidia researchers developed dynamic memory sparsification (DMS), a technique that compresses the KV cache in large language models by up to 8x while maintaining reasoning accuracy — and it can be ...
Some believe that AI firms of generic AI ought to be forced into leaning into customized LLMs that do mental health support. Good idea or bad? An AI Insider analysis.