Designing evidence strength for an AI detector report
How sample length, sentence count, feature summaries, and explicit fallback states can keep a detector from overstating confidence.

An AI detector can return a precise-looking percentage from a weak sample. That is a product problem before it is a model problem.
A result such as "72% AI likelihood" looks self-contained, but it leaves out the conditions that make the number more or less useful: how much text was analyzed, how many sentences can be compared, whether the primary provider answered, and whether document extraction changed the material under review.
While building the report flow for Detector de IA, I found it more useful to model evidence strength separately from the detector score. The goal is not to turn heuristics into certainty. It is to prevent the interface from presenting every numerical result with the same authority.
Disclosure: I used AI assistance while drafting this article and reviewed it against the implementation before publishing.
A score and its reliability answer different questions
The score answers: what did the detector estimate for this input?
Reliability answers: how much text-only evidence was available to support a stable review?
Those values should not be collapsed into one field. In the report model, they are separate:
```ts
type DetectorReliability = {
level: "high" | "medium" | "low";
reasons: string[];
};
type DetectionReport = {
scores: {
aiGenerated: number;
likelyHuman: number;
};
reliability: DetectorReliability;
limitations: string[];
};
```
This distinction matters because a provider can return a high score for a short paragraph. The provider response may be valid as a response, while the sample is still too small for strong interpretation.
Separating the fields also improves the UI contract. The score can remain visible, but the report can place a low-reliability label and its reasons next to it. A user sees both the estimate and the reason not to overread it.
Treat sample sufficiency as a product guardrail
The current reliability function uses character count and sentence count as transparent guardrails.
Fewer than 1,000 characters produces low reliability.
From 1,000 to fewer than 2,400 characters produces medium reliability.
Fewer than five sentences produces low reliability.
Fewer than ten sentences can reduce an otherwise high result to medium.
These are product thresholds, not universal scientific constants. They encode a practical principle: rhythm, repetition, and variation are harder to compare when the sample has very little internal structure.
That qualification is important. Calling the thresholds "accuracy calibration" would overstate what they do. They do not validate the detector. They only describe whether the report had enough material for a broader text-only comparison.
The minimum accepted input can also be lower than the threshold for stronger evidence. Detector de IA accepts text starting at 300 characters, but accepting an input does not mean presenting it as a strong sample. The workflow can still return a result while clearly downgrading its reliability.
This pattern applies outside AI detection. When an application can technically process an input that is too sparse for confident interpretation, represent input sufficiency as its own state.
Return reasons, not just a badge
A low label without an explanation becomes another unexplained score. The report therefore returns a list of reliability reasons.
For a short sample, the reason says that longer passages reduce unstable conclusions. For a low sentence count, it explains that there are too few sentences to compare rhythm, repetition, and variation.
Reasons are more useful than hard-coded UI copy for three reasons:
The server owns the decision and the explanation together.
English and Spanish reports can receive localized reasons from the same branch.
Exported or printable reports retain the context even when they leave the application UI.
This also makes future changes easier to audit. If a threshold changes, the branch and its user-facing explanation change in the same place.
Keep feature summaries descriptive
The text analysis builds a feature summary with fields such as:
character, sentence, and paragraph counts
average sentence length and sentence-length variance
short- and long-sentence ratios
repeated-segment ratio
punctuation variety
generic marker count and examples
These features provide context for analysis bullets and fallback behavior. They should not be described as proof that a person or a model wrote the text.
Many human documents contain repeated phrasing, templates, uniform sentence lengths, or generic transitions. Edited, translated, and paraphrased AI text can show the opposite. A feature is a review signal, not provenance.
That is why the report language uses phrases such as "can increase the appearance of AI-generated writing." The wording preserves the difference between observing a textual pattern and assigning authorship.
Make fallback provenance visible
The report flow calls a primary detection service, with retry handling for retryable failures. If that service remains unavailable, the application creates a local fallback estimate from the text features.
The dangerous implementation would be to return that fallback through the same presentation path without saying where it came from. The user would see a percentage but not the material change in how it was produced.
Instead, the fallback source changes the report in two ways:
the summary explicitly says that a local fallback estimate was generated because the primary service was unavailable
reliability is downgraded and gains a reason describing that fallback
This is a general reliability rule for products that combine external providers with local recovery logic: graceful degradation should preserve provenance. A fallback can keep a workflow moving, but it must not impersonate the primary result.
Document extraction adds another boundary
Pasted text and uploaded documents do not enter the pipeline in exactly the same way. PDF and DOCX files first need text extraction; TXT and Markdown files can be read more directly.
For document inputs, the report includes a limitation explaining that highlights are aligned against the extracted text shown in the report. This matters because a sentence highlight refers to the extracted representation, not necessarily to the document's original visual coordinates.
That limitation belongs in the report data, not only in help text. It travels with the result and remains visible when the user copies or exports the report.
Limitations are part of the response schema
The most important report fields are not always the scores. The response also states that:
the analysis is probabilistic and does not prove the actual provenance of the text
paraphrasing, translation, templates, and heavy editing can reduce detector accuracy
document highlights align against extracted text
These statements are not legal decoration. They define how the output should be used. False positives and false negatives remain possible, so the report should support manual review rather than become the sole basis for a high-impact decision.
Putting limitations in structured data gives every presentation surface the same baseline. The web report, copied summary, and printable export do not need to invent their own warning language.
The reusable design pattern
The implementation can be summarized as five decisions:
Keep model scores separate from evidence reliability.
Derive sample-sufficiency reasons from observable input properties.
Describe textual features as signals, never proof of authorship.
Expose fallback source and downgrade reliability when it is used.
Return limitations as first-class report data.
None of these decisions makes an AI detector definitive. That is the point. The report becomes more useful by showing where its confidence should end.
I am applying this pattern in the free text and document review flow at Detector de IA:

