LLM Models and API Integration

Summary

Model selection and API integration discussions reveal a pragmatic community approach: use whatever works best for the task. The conversation evolved from “GPT-4 or nothing” debates to nuanced understanding of model capabilities, inference backends, and cost-performance trade-offs. Mixtral emerged as the community favorite for game engines, while smaller models were tested extensively for specific subtasks. OpenRouter became the de facto cloud API, while TextGenWebUI, KoboldCPP, and TabbyAPI dominated local inference.

Key Concepts

Local vs Cloud Trade-offs

  • Local: Privacy, no API costs, offline capability, full control
  • Cloud: Better quality, faster setup, no hardware requirements, pay-per-use

Multi-Model Workflows

Different tasks need different models: logic vs creativity vs narration vs instruction-following.

Inference Backends

Software that runs LLMs locally: TextGenWebUI, KoboldCPP, TabbyAPI, LM Studio, Aphrodite.

Model Formats

  • GGUF: Quantized models for llama.cpp/KoboldCPP (CPU-friendly)
  • EXL2: Quantized for ExLlamaV2/TabbyAPI (GPU-optimized)
  • SafeTensors: Full precision models

Evolution of Ideas

Phase 1: “Just Use GPT-4” Debates (January 2024)

giftedgummybee [07:08]: “Have you tried with gpt-4 firstly”

50h100a [07:09]: “oh my god”

50h100a [07:13]: “no gpt-4, no finetunes, you can’t ‘just train a model to do exactly what you want all the time’”

giftedgummybee [07:09]: “doesn’t matter, you can distill the skill into a local model later. You need to know if its even possible to work first on a strong model”

Early Tension

“Prove it works with GPT-4 first” vs “If it needs GPT-4, it’s not a real solution”. This tension drove the community to make local models work.

50h100a [06:35]: “if you need chatgpt to make it work you’ve already lost :kekw:“

Phase 2: Local Model Limitations (January 2024)

50h100a [07:25]: “local models can handle limited statblocks. the bigger issue with them is that they pollute your prompt horribly”

giftedgummybee [07:24]: “Not even mixtral?”

Context: Mixtral-8x7B-Instruct was the first local model considered “good enough” for game logic.

vali98 [07:32]: “by heavy i mean slow” Context: Inference speed concerns for local models

Phase 3: Inference Backend Discussions (January 2024)

50h100a [07:40]: “…im a maintainer of aphrodite”

vali98 [07:40]: “koboldcpp / llamacpp would be preferable because gguf is more accessible”

50h100a [07:40]: “ill fight you”

Backend Wars

Good-natured rivalry between inference backend developers. KoboldCPP/llama.cpp (GGUF format) vs Aphrodite (GPTQ/EXL2 format). GGUF won for accessibility.

Phase 4: Grammar Constraints for Small Models (January 2024)

vali98 [15:28]: “Look at this, using this Grammar preset in koboldcpp: root::= '[YES]' | '[NO]' I can force the answer”

vali98 [15:31]: “https://github.com/ggerganov/llama.cpp/tree/master/grammars Essentially filters out text generated to fit”

monkeyrithms [15:33]: “wow, that is very useful. Ill look into this.”

Constraint-Based Generation

Grammars in llama.cpp/KoboldCPP allow forcing specific output formats, enabling smaller models to handle structured tasks.

Phase 5: TinyLlama Experiments (January 2024)

vali98 [15:22]: “yeah I think that tracks, i tried your example on a smaller model like TinyLlama and it seems to kinda work”

vali98 [15:38]: “just FYI this is TinyLlama with grammar, it seems to be somewhat ok at reasoning”

giftedgummybee [15:33]: “tinyllama has some… issues”

giftedgummybee [15:57]: “im quite sure tinyllama has issues”

monkeyrithms [15:57]: “thats the issue i get with small models”

Small Model Reality

TinyLlama (1.1B parameters) struggles with consistency despite occasional success. Not reliable enough for production game engines.

Phase 6: Mixtral as Sweet Spot (January 2024)

monkeyrithms [16:21]: “right now I’m using Mixtral for everything”

monkeyrithms [16:29]: “there is no aspect of it that requires GPT-4 or Goliath or stuff like that, so that’s encouraging. Mixtral is kind of perfect for it”

50h100a [08:05]: “then it should work via goliath” Context: Goliath-120B was considered premium local model, but unnecessary

Community Consensus

Mixtral-8x7B-Instruct emerged as the goldilocks model: Good enough for game logic, cheap on OpenRouter, runnable locally on consumer hardware, doesn’t need GPT-4 quality.

Phase 7: OpenRouter Integration (January 2024)

monkeyrithms [19:41]: “I accidentally sent it with my openRouter API (where I’ve found a ton of success running this game with the model ‘Mixtral’)”

monkeyrithms [19:44]: Multi-model configuration:

if url_type == "url1": #Use this for following basic instructions
    base_url = "https://openrouter.ai/api/v1"
    api_key = "put your API here"
    model_choice = "mistralai/mixtral-8x7b-instruct"

monkeyrithms [19:44]: “It tries to split different types of workload to different models, which allows you to specialize with models and/or save on inference costs (for instance, the only 2 we need a somewhat decent model for are the last 2)”

monkeyrithms [19:44]: “Or you can just use the same one for all of them — if it’s both cheap and good with logic, like Mixtral is”

OpenRouter Pattern

OpenRouter.ai became the standard cloud API: unified interface, multiple models, pay-as-you-go pricing, Mixtral for $0.27/1M tokens (at that time).

Phase 8: Local Inference Setup Challenges (January 2024)

monkeyrithms [19:41]: “Currently I only know how to support local inference (like textgenwebui is what I use) and OpenRouter. I haven’t figured out Horde”

hermokratesthelate [19:41]: “TabbyAPI would be a good one to add as well.”

monkeyrithms [19:42]: “In theory, as long as it uses the OpenAI API style, it should work if you have the URL and everything.”

hermokratesthelate [19:56]: “Damn. I can’t get it to connect to TabbyAPI. Eventhough it connects to http://127.0.0.1:5000

monkeyrithms [20:31]: “that’s strange.. im a bit new to all this, so getting the LLM set up in the first place, with just textgenwebui or the OpenAI API, was the biggest headache ever because i was so new to it”

Local Setup Friction

Local inference setup is a major barrier. Connection issues, port conflicts, API compatibility problems are common. OpenAI-compatible API helped but not perfect.

hermokratesthelate [19:59]: “Yeah. Like Ooba but specifically for exl2 files.” Context: TabbyAPI = TextGenWebUI alternative for EXL2 format models

monkeyrithms [20:37]: “for ooba, make sure the OpenAI box is checked, if it isn’t”

monkeyrithms [20:44]: “this is not the right API box to check. uncheck that one (I think), and check the box to the left (cut off screen) that says ‘OpenAI’. Then,. click ‘Apply Flags‘“

Phase 9: Sonja 7B Testing (January 2024)

monkeyrithms [20:09]: “I just haven’t found any models that I can locally run consistently handle the ‘advanced’ instructions/quest narratives without hallucinations or false positives, that is a work-in-progress (maybe you could help me 😄 ). Sonja 7b is one I tested lately that is surprisingly good at most things, though”

monkeyrithms [07:58]: “Sonja 7B is pretty good at these” [Screenshot showing successful character dialogue]

monkeyrithms [21:14]: “I just tried to run the quest on Sonja 7b again, and it just wildly hallucinated the answers to the quest questions with reckless abandon”

Model Inconsistency

7B models like Sonja-7B work well for dialogue and simple tasks but fail unpredictably on structured logic tasks (quest progression, instruction following).

Phase 10: Model Task Specialization (January 2024)

monkeyrithms [19:45]: “The reason I found out I needed a smarter model for narrators (which usually play as the setting) is because they handle quest progression, and if smaller models are frequently messing that up, it could probably become a very frustrating experience for the player”

monkeyrithms [20:46]: “I just tested the quest and this time, Mixtral actually bombed it.. 🤔 pretty rare, but it happened… it might be a bit of a hunt finding a smaller model that handles this as well as they can handle some other things.”

monkeyrithms [context]: Task-specific model recommendations:

  • Logic/basic instructions: Mixtral, GPT-3.5 (cheap, reliable)
  • Creative dialogue: Mixtral, RP-tuned models (character consistency)
  • Advanced instructions: Mixtral, GPT-3.5 (quest progression)
  • Narration: Mixtral, creative models (descriptive writing)

Multi-Model Strategy

Don’t use one model for everything. Assign fast/cheap models to simple tasks, reserve better models for critical paths (quest logic).

Phase 11: GPT-3.5 as Reliable Fallback (January 2024)

monkeyrithms [20:11]: “Right now I’ve been mostly testing it off openRouter.ai, where I call inference from Mixtral Instruct (the original one — its cheap there) or GPT 3.5”

monkeyrithms [20:46]: “but gpt 3.5 is solid, its never bombed a quest task yet”

GPT-3.5 Pattern

GPT-3.5-Turbo emerged as the “it just works” option: reliable for quest logic, cheap, fast, available everywhere. Use for critical paths when Mixtral fails.

Phase 12: Instruct vs RP Models (January 2024)

hermokratesthelate [20:10]: “What is your setup? I am trying to run it with a Mixtral RP model, and it will not connect to TabbyAPI for me.”

monkeyrithms [20:11]: “Right now I’ve been mostly testing it off openRouter.ai, where I call inference from Mixtral Instruct (the original one — its cheap there)”

Model Type Matters

Instruct models (Mixtral-Instruct) follow instructions better for game logic. RP models (roleplay-tuned) are better for character dialogue but worse at structured tasks.

Phase 13: Inference Speed Considerations (February 2024)

veritasr [06:17]: “That way it updates all the information based on user input correctly, then figures out what fluff it needs to add into the prompt for instruct generation, and you theoretically could get away with a single prompt generation. So.. about as fast as normal chat.”

Context: Architecture to minimize LLM calls for speed

Phase 14: LM Studio Support (January 2024)

monkeyrithms [20:21]: “Oh — I’ve also been able to get LM Studio working with it”

LM Studio

LM Studio = User-friendly GUI for local inference. Works with ChatBot RPG via OpenAI-compatible API.

Technical Patterns

1. Multi-Model Workflow Architecture

User Action
  ↓
Determine Task Type
  ↓
┌─────────────────────────┐
│ Simple Logic            │ → GPT-3.5 / Mixtral (fast endpoint)
│ Character Dialogue      │ → Mixtral / RP model (creative)
│ Quest Progression       │ → GPT-3.5 (reliable)
│ Narration               │ → Mixtral / Creative model
│ State Extraction        │ → GPT-3.5 / Mixtral (structured)
└─────────────────────────┘
  ↓
Return Result

2. API Abstraction Layer

class ModelRouter:
    def __init__(self):
        self.models = {
            "fast": {  # For simple yes/no, quick logic
                "provider": "openrouter",
                "model": "mistralai/mixtral-8x7b-instruct",
                "cost": "low"
            },
            "creative": {  # For dialogue, narration
                "provider": "openrouter",
                "model": "mistralai/mixtral-8x7b-instruct",
                "cost": "low"
            },
            "reliable": {  # For quest logic, critical paths
                "provider": "openrouter",
                "model": "openai/gpt-3.5-turbo",
                "cost": "medium"
            },
            "local": {  # For offline/privacy
                "provider": "local",
                "base_url": "http://127.0.0.1:5000/v1",
                "model": "local-mixtral"
            }
        }
 
    def route(self, task_type):
        return self.models.get(task_type, self.models["fast"])

3. OpenAI-Compatible Client

from openai import OpenAI
 
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",  # or local URL
    api_key="your-api-key"  # or "null" for local
)
 
response = client.chat.completions.create(
    model="mistralai/mixtral-8x7b-instruct",
    messages=[{"role": "user", "content": prompt}]
)

Works with:

  • OpenRouter (cloud)
  • TextGenWebUI (local, with —api flag)
  • TabbyAPI (local)
  • KoboldCPP (local, with —usecublas)
  • LM Studio (local)
  • Ollama (local, via compatibility layer)

4. Grammar-Constrained Generation (Local Only)

# KoboldCPP / llama.cpp grammar
grammar = """
root ::= "[YES]" | "[NO]"
"""
 
# Forces output to be exactly [YES] or [NO]
response = generate_with_grammar(prompt, grammar)

Enables small models (7B) to handle structured outputs reliably.

Model Comparison

ModelParametersUse CaseProsConsCost (OpenRouter)
Mixtral-8x7B-Instruct47BGeneral game logic, narrationCheap, reliable, runnable locallyNot creative enough for some RP~$0.27/1M tokens
GPT-3.5-TurboUnknownCritical quest logic, reliable tasksNever fails, fast, consistentCosts more, requires internet~$0.50/1M tokens (est)
GPT-4UnknownNot neededBest qualityToo expensive, overkill~$30/1M tokens
Sonja-7B7BSimple dialogue, testingSmall, fastInconsistent, hallucinates on logicFree (local)
TinyLlama1.1BExperiments onlyTiny, very fastToo unreliableFree (local)
Goliath-120B120BNot neededVery capableToo slow, expensive, overkillN/A

Local Inference Hardware Requirements

For Mixtral-8x7B (GGUF Q4):

  • Minimum: 24GB VRAM (GPU) or 32GB RAM (CPU-only, slow)
  • Recommended: 2x 3090 (24GB each) or Apple M2 Max/Ultra
  • Speed: ~5-15 tokens/sec (GPU), 1-3 t/s (CPU)

For 7B models (Sonja, etc.) (GGUF Q4):

  • Minimum: 8GB VRAM or 16GB RAM
  • Recommended: Single 3060 (12GB) or better
  • Speed: ~20-40 t/s (GPU), 5-10 t/s (CPU)

Inference Backend Comparison

BackendFormatPlatformEase of UsePerformanceNotes
TextGenWebUIManyWindows/LinuxMediumGoodMost popular, feature-rich, OpenAI API mode
KoboldCPPGGUFAll platformsEasyExcellent (CPU)Standalone, grammar support, no Python needed
TabbyAPIEXL2Linux/WindowsHardExcellent (GPU)Fast, optimized, requires technical setup
LM StudioGGUFAll platformsVery easyGoodGUI, beginner-friendly, limited features
AphroditeGPTQ/EXL2LinuxHardExcellent (GPU)High performance, server-focused
OllamaGGUFAll platformsEasyGoodSimple CLI, growing ecosystem

Design Principles

  1. Task-Appropriate Models: Don’t use premium models for simple tasks
  2. Fallback Strategy: Have backup model if primary fails
  3. OpenAI-Compatible APIs: Standardize on OpenAI client for portability
  4. Local + Cloud Hybrid: Use cloud for development, local for production/privacy
  5. Instruct for Logic: Use instruct-tuned models for game mechanics
  6. RP for Dialogue: Use RP-tuned models for character interactions (if separating)
  7. Grammar for Small Models: Constrain output format to improve reliability
  8. Monitor Costs: Track API usage, optimize prompt sizes
  9. Model Independence: Abstract model selection, easy to swap
  10. Test Locally: Ensure local models work before relying on cloud

Implementation Considerations

API Key Management

  • Store in environment variables or config files (not in code)
  • Support multiple API keys (OpenRouter, OpenAI, local)
  • UI for easy configuration
  • Validate keys on startup

Error Handling

  • Retry logic for API failures
  • Fallback to different model if primary fails
  • Timeout handling (long inference)
  • Rate limit handling (429 errors)
  • Graceful degradation (return cached/default content)

Cost Optimization

  • Cache repeated prompts
  • Minimize prompt size (don’t send full chat history)
  • Use cheaper models for non-critical tasks
  • Batch requests when possible
  • Monitor spending with OpenRouter dashboard

Local Inference Optimization

  • Quantization: Q4_K_M GGUF for balance of size/quality
  • Context size: Match to model capability (8K for Mixtral)
  • GPU layers: Max out VRAM without OOM
  • Batch size: Tune for hardware
  • Flash attention: Enable if supported

Common Pitfalls

Model Integration Antipatterns

  1. Using GPT-4 as baseline: It’s overkill and expensive. Start with Mixtral/GPT-3.5.
  2. Single model for everything: Different tasks need different capabilities.
  3. Not testing local models: Assuming they won’t work because “they’re small”.
  4. Hardcoding model names: Abstract into config so users can change.
  5. Ignoring grammar constraints: They make small models viable for structured tasks.
  6. No fallback strategy: If primary model is down, application breaks.
  7. Not validating outputs: Models hallucinate, always validate structured responses.
  8. Trusting RP models for logic: They’re trained for creativity, not instruction-following.

Tools and Services

Cloud APIs

  • OpenRouter: Multi-model aggregator, pay-per-use, $5 minimum
  • OpenAI: Direct access to GPT-3.5/GPT-4
  • AI Horde: Free (donation-based) distributed inference
  • Together.ai: Cheap Mixtral hosting
  • Groq: Very fast inference for select models

Local Inference

Model Sources

  • Hugging Face: Main source for open models
  • TheBloke: Quantized versions of popular models (GGUF, EXL2)
  • OpenRouter: Access many models via single API

Key Insights

  1. Mixtral-8x7B-Instruct is the goldilocks model for LLM game engines (~2024-2025)
  2. GPT-3.5 is the reliable fallback when logic must not fail
  3. Small models (7B) are inconsistent despite occasional success
  4. Multi-model workflows save money without sacrificing quality
  5. OpenRouter democratized model access - unified API, pay-per-use
  6. Local setup is still painful - connection issues, API compatibility, configuration
  7. Grammar constraints make small models viable for structured outputs
  8. Instruct models ≠ RP models - use the right tool for the task
  9. You don’t need GPT-4 for game engines
  10. OpenAI-compatible APIs won - everyone standardized on that interface

Future Directions

  • Smaller models getting better (8B models approaching old 70B quality)
  • Local inference getting faster (better quantization, hardware)
  • More specialized models (fine-tuned for game logic, dialogue, etc.)
  • Function calling becoming standard (structured outputs)
  • Longer context windows (32K+) enabling more game state in prompts
  • More OpenAI-compatible providers
  • Better error handling and retry logic
  • Streaming responses becoming standard
  • Usage tracking and cost management tools

Open Questions

Unresolved Issues

  1. What’s the optimal model size for game engines? (7B too small, 70B overkill, 8x7B just right?)
  2. When will RP-tuned instruct models be viable? (Good at both logic AND creativity)
  3. How to dynamically select models based on task complexity?
  4. Will function calling replace manual state extraction?
  5. Can grammar constraints fully replace larger models for structured tasks?
  6. What’s the future of distributed inference (AI Horde model)?

Timeline

  • January 2024: GPT-4 vs local debates, Mixtral identified as viable
  • January 2024: Grammar constraints discovered for small models
  • January 2024: TinyLlama experiments (mostly failed)
  • January 2024: OpenRouter adopted as standard cloud API
  • January 2024: TextGenWebUI, KoboldCPP, TabbyAPI integration efforts
  • January 2024: Sonja-7B testing (inconsistent results)
  • January 2024: Multi-model workflow patterns established
  • January 2024: GPT-3.5 emerges as reliable fallback
  • 2024-2025: Mixtral dominance continues for game engines

Pattern Library

Prompt Library


Core Achievement

The community successfully identified practical model choices for LLM game engines without requiring GPT-4. Mixtral-8x7B-Instruct + GPT-3.5 fallback became the standard pattern. OpenRouter’s unified API made multi-model workflows accessible. Local inference works but with setup friction. Grammar constraints enable small models for structured tasks.