# Statistical Reliability Measures

The MCP server returns statistical measures with every behavioral query to help assess data reliability. Since queries are limited to 300 records for performance, sampling introduces variability that affects result trustworthiness.

### Response Structure

Every behavioral response includes `statistical_reliability`:

```json
{
  "success": true,
  "data": [...],
  "count": 300,
  "statistical_reliability": {
    "sampling_statistics": {
      "sampling_ratio": 0.15,
      "total_population": 2000,
      "sample_size": 300,
      "user_count": 12,
      "average_user_representation": 0.85,
      "representation_quality": "High"
    },
    "sample_adequacy": {
      "required_sample_size": 322,
      "actual_sample_size": 300,
      "adequacy_ratio": 0.93,
      "is_adequate": false,
      "reliability": "Low"
    },
    "confidence_interval": {
      "lower": 0.12,
      "upper": 0.18,
      "margin_of_error": 0.03,
      "confidence_level": 0.95
    }
  }
}
```

### Key Metrics

**Sampling Statistics**

* `sampling_ratio`: Percentage of total data in sample (0.15 = 15%)
* `total_population`: Total records matching your query
* `representation_quality`: How well sample preserves user distribution ("High", "Medium", "Low")

**Sample Adequacy**

* `required_sample_size`: Minimum needed for statistical validity
* `reliability`: Overall assessment ("High", "Adequate", "Low")

**Confidence Interval**

* `lower`/`upper`: Range where true population value likely falls
* `margin_of_error`: Uncertainty range (±0.03 = ±3%)

### Reliability Assessment

**High Reliability**: `representation_quality: "High"` + `reliability: "High"` + low margin of error

* Results are trustworthy and representative

**Medium Reliability**: Mixed indicators or `reliability: "Adequate"`

* Results are usable but note limitations in analysis

**Low Reliability**: `representation_quality: "Low"` or `reliability: "Low"` + high margin of error

* Avoid drawing conclusions; sample too small or biased

### Improving Sample Quality

When reliability is low, ask the MCP to:

* **Increase max\_results**: Request 500-1000 records instead of 300
* **Broaden query parameters**: Expand date ranges or criteria
* **Check user diversity**: Ensure adequate representation across user types

Example: *"The sample reliability is low. Can you re-run this query with max\_results=800 to get better statistical confidence?"*

### Technical Implementation

#### Proportional Sampling Method

The server uses **stratified proportional sampling** by user to ensure representative results:

1. **User Distribution Analysis**: Calculate each user's proportion in the total population
2. **Quota Allocation**: Assign sample slots proportionally to maintain user representation
3. **Random Sampling**: Randomly select records within each user's quota
4. **Rounding Correction**: Distribute remaining slots to users with highest fractional quotas

#### Statistical Calculations

**Sample Adequacy Formula**:

```
Required Sample Size = (Z² × 0.25) / (margin_of_error²)
With finite population correction: n / (1 + (n-1)/N)

Where:
- Z = 1.96 (95% confidence level)
- margin_of_error = 0.05 (5% default)
- N = population size
```

**Confidence Interval Calculation**:

```
p = sample_size / population_size
Standard Error = √(p × (1-p) / population_size)
Margin of Error = Z × Standard Error
CI = [p - margin_of_error, p + margin_of_error]
```

**Representation Quality**:

* Measures how closely sample user distribution matches population
* Calculated as average of per-user representation scores
* Score = min(actual\_ratio/expected\_ratio, expected\_ratio/actual\_ratio)

### YouTube Data Note

YouTube responses include `deduplication_info` showing how data was cleaned for unique user-video combinations before sampling.

### Best Practice

Always check `reliability` and `margin_of_error` before analyzing results. When in doubt, request larger samples for more confident insights.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.generationlab.org/basics/statistical-reliability-measures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
