What are GGUF models? What are model "quants"?

Layla
Aug 21, 2024
2 min read

The inference engine behind Layla is "llama.cpp" (https://github.com/ggerganov/llama.cpp), a very popular open-source inference engine that allows running LLM on mobile.

This inference engine supports a specific file format, called the GGUF. This is the AI model that powers all features in Layla.

The pre-built models in Layla that you downloaded when you first open the app are good, but the true power of Layla comes from being to load any AI model you wish. This can include uncensored ones, professional ones, roleplay ones, or any other that have been created by the local AI community.

You can find all GGUF models that Layla supports here: https://huggingface.co/l3utterfly

How to load a custom GGUF into Layla

Choose a model that you like, in this example, we will use the popular Stheno-Mahou: https://huggingface.co/l3utterfly/llama-3-Stheno-Mahou-8B-gguf
Click the "Files and versions" tab

3. You will see a list of files (models) that you can download

Each filename is annotated with a QXX "quant", for example "Q2_K". These are quants.

You will notice the higher the quant (the bigger the number after "Q"), the larger the file size. The larger the file size, the higher quality the responses from the AI will be, however, it means you will need better hardware to try to run it.

As a general rule of thumb, I suggest starting with the Q4_K model. If you feel it's fast enough, you can try going up to Q6 or Q8, if you feel it's too slow, then go down to Q2.

There are three "special" quants: "Q4_0_4_4", "Q4_0_4_8", and "Q4_0_8_8". These are special models for the latest hardware. To learn more about them and if you can use them, read here: https://www.layla-network.ai/post/layla-supports-i8mm-hardware-for-running-llm-models

4. Download the model by clicking the little download arrow