The AI input quality problem is the most expensive bug in production agent stacks. Bad prompts produce inconsistent output, the model cannot fix what you did not specify, and you pay for the inference anyway. Every team that ships agents has hit this, and every team has reached for some kind of evaluation tooling.
The catch is that there is no single framework that captures prompt quality completely.
PQS combines five. This post explains why, how the reconciliation works, and what it gives you that single-framework tools cannot.
No Single Framework Is Enough
There are five widely-used frameworks for evaluating prompt and LLM output quality. Each is right about something specific. Each is incomplete about everything else.
PEEM (Prompt Engineering Evaluation Metric) weights structural quality. It asks whether the prompt has a clear role, an explicit output format, defined constraints, and unambiguous task framing. PEEM is strong on detecting prompts that are too vague to produce consistent output. It is weak on detecting prompts that are well-structured but missing the context the model needs to answer correctly.
RAGAS comes out of the RAG evaluation literature. It weights grounding: whether the prompt provides the context, examples, or retrieved evidence the model needs to give a faithful answer. RAGAS is strong on detecting prompts that ask for facts the model cannot verify from the input. It is weak on prompts that have rich context but ambiguous instructions for what to do with it.
MT-Bench comes from LMSYS and weights multi-turn instruction following. It evaluates whether a prompt sets up a tractable task that an LLM can execute across a conversation. MT-Bench is strong on detecting prompts that work in isolation but break down inside an agent loop. It is weak on single-shot prompts that never enter a multi-turn context.
G-Eval uses an LLM as judge. It weights LLM-judged faithfulness, coherence, and fluency of the model's likely output for a given prompt. G-Eval is strong on detecting prompts that the model can technically answer but cannot answer well. It is weak when the judge model shares the same blind spots as the model under evaluation.
ROUGE weights n-gram surface overlap between expected and produced output. It is the canonical metric in summarization evaluation. ROUGE is strong on detecting prompts where the expected output is highly templated. It is weak on creative or open-ended prompts where surface overlap is not the right signal.
Each of these frameworks is published, peer-reviewed in some form, and in active use across the industry. None of them on its own is sufficient.
The 8-dimension rubric is the reconciliation layer
PQS scores prompts across 8 dimensions: specificity, clarity, grounding, examples, output format, role, constraints, and verifiability. The rubric is the reconciliation layer where multiple frameworks contribute signal to a single dimensional score.
- Specificity pulls signal primarily from PEEM (does the prompt define a concrete task) and from RAGAS (does the prompt include the specific context the model needs).
- Clarity pulls from PEEM (structural clarity) and from G-Eval (does the model's likely output match what the prompt seems to be asking for, indicating the prompt is unambiguous).
- Grounding pulls from RAGAS (is context provided) and ROUGE (when expected output exists, does the prompt set up surface overlap with it).
- Examples pulls from MT-Bench (few-shot example presence is a strong predictor of multi-turn stability) and PEEM (are the examples structurally usable).
- Output format pulls from PEEM (is format specified) and ROUGE (does the specified format align with verifiable output structure).
- Role pulls from PEEM (is the role explicit) and G-Eval (does the role match what produces the highest-judged output).
- Constraints pulls from PEEM (are constraints stated) and MT-Bench (do the constraints survive multi-turn drift).
- Verifiability pulls from G-Eval (can the output be judged) and ROUGE (can the output be measured against a reference).
The composite PQS score is not a simple average of the five framework scores. It is a weighted synthesis that accounts for which frameworks are reliable signal in which dimensions. PEEM gets heavier weight in structural dimensions. RAGAS gets heavier weight in grounding dimensions. G-Eval gets heavier weight where LLM-judged quality is the best proxy. The weights are not uniform across the rubric because the frameworks themselves are not uniform in what they reliably measure.
Why /pqs-grade Shows Three Frameworks
The PQS bazaar endpoint returns three framework sub-scores: PEEM, RAGAS, G-Eval. Not five. This is a deliberate choice.
These three are the most-cited in production agent stacks as of 2026. PEEM because structural quality is the failure mode developers see most often. RAGAS because RAG is the dominant architecture for agent context. G-Eval because LLM-as-judge has become the default for online evaluation in production. If you only saw three framework scores, these are the three that would tell you the most about your prompt.
MT-Bench and ROUGE are still part of the composite score. They are not surfaced in the bazaar endpoint response because they require more context to interpret correctly. MT-Bench needs a multi-turn use case to score meaningfully. ROUGE needs a reference output to compare against. Surfacing them in a single-shot $0.001 endpoint would either give noisy signal or require additional input fields that fight the simplicity of the bazaar primitive.
The full PQS SaaS at promptqualityscore.com returns all five framework scores along with the full 8-dimension breakdown, dimension-by-dimension diagnostics, and rewrite suggestions. The bazaar endpoint is the pre-flight check. The SaaS is the deep diagnostic.
Where The Difference Shows Up
Consider a prompt:
Write a poem about cats.
Score this with G-Eval alone and you get a moderate score. The model can produce a fluent poem about cats. The output is coherent. A G-Eval judge would rate it acceptable.
Score it with ROUGE alone and you get an unhelpful signal. There is no reference output to compare against.
Score it with PEEM alone and you get a low score. No role, no output format, no constraints, no length specified.
Score it with PQS and you get a composite 21 out of 100, grade F. The verdict reads: "RAGAS flags missing background. Supply the reader, purpose, or upstream context."
The PQS score is lower than any single framework would assign because PQS sees the prompt across all five framework lenses simultaneously. PEEM sees missing structure. RAGAS sees missing context. G-Eval sees acceptable output but cannot tell you whether the output is what the user actually wanted. The weakest dimension drives the verdict, and the composite reflects the truth that this prompt will produce inconsistent results in production regardless of how fluent any single output is.
A team scoring with G-Eval would ship this prompt and wonder why their output quality varies week to week. A team scoring with PQS sees the failure mode upstream, fixes it before deploying, and gets predictable output.
The point of combining frameworks is not to be exhaustive for its own sake. The point is that prompts fail in different ways, and each framework only catches some of those ways. A prompt that scores well on three frameworks but poorly on a fourth is not a good prompt. It is a prompt with one specific failure mode that the three good scores will hide from you if those are the only ones you check.
PQS makes the failure mode visible.
Try It
Run /pqs-grade in the x402 bazaar for $0.001 USDC per check. No API key, no signup. Pay with your wallet. The response includes the composite score, grade, three framework sub-scores, and a verdict line.
Upgrade to PQS Pro for the full 8-dimension SaaS with all five framework scores, rewrite suggestions, and CI integration via the PQS GitHub Action.
The AI input quality problem is real. PQS is how you measure it.