Tools
Poltergeist
Poltergeist is Ghost Security Agent's secret scanner. It scans source code for leaked API keys, tokens, certificates, and credentials using a dual-engine architecture that combines speed with precision.
Architecture
Dual-engine design
Poltergeist uses two regex engines and selects the best one automatically:
Hyperscan engine -- a high-performance multi-pattern matcher. It evaluates all rules simultaneously in a single pass over the file content, maintaining consistent scan times regardless of rule count. With 100 rules, Hyperscan scans the Linux kernel (1.4 GB) in about 8 seconds.
Go regex engine -- a fallback engine for environments where Hyperscan isn't available, and the default for single-pattern scans. Performance scales linearly with rule count.
In auto mode (the default), Poltergeist uses Hyperscan for multi-pattern scans when available, and Go regex for single patterns or when Hyperscan isn't installed.
Entropy analysis
Every match is evaluated for Shannon entropy (a measure of randomness). Each rule defines a minimum entropy threshold tuned to its specific pattern. Matches below the threshold are filtered out by default.
For example:
- A generic password variable (
ghost.generic.3) has a threshold of 3.5 bits, because passwords can be relatively short - An AWS session token (
ghost.aws.2) has a threshold of 5.5 bits, because these tokens are long base64 strings with high randomness - An OpenAI API key (
ghost.openai.1) has a threshold of 5.1 bits
The -low-entropy flag shows matches below threshold, useful for debugging rules or investigating potential issues.
Automatic redaction
Poltergeist redacts secrets in its output by default. Redaction offets are specified in individual rules and vary by secret type:
sk-proj-0JdlOY****hDvSYA (OpenAI key)
Bu/9****KBBJ (AWS secret)
Scan output is safe to share, log, or include in reports without exposing actual credential values.
Key features
- 100 built-in rules covering 50+ services across cloud providers, AI services, git platforms, CI/CD, communication tools, databases, payment processors, and more
- Dual-engine architecture with automatic engine selection
- Entropy filtering to reduce false positives from low-randomness matches
- Automatic redaction of secrets in all output formats
- Multiple output formats -- text (colored), JSON (machine-readable), Markdown (reports)
- Custom rules -- extend with your own YAML rule files
- Binary file detection -- automatically skips binary files, archives, and media
- Embedded rules -- rules compiled into the binary, no external files needed at runtime
CLI reference
Usage
poltergeist [options] <path> [pattern1] [pattern2] ...
Scans the file or directory at <path>. Optionally provide one or more regex patterns to match in addition to (or instead of) built-in rules.
Flags
| Flag | Default | Description |
|---|---|---|
-engine | auto | Pattern matching engine: auto, go, or hyperscan |
-rules | -- | Path to YAML rule file or directory of rule files |
-format | text | Output format: text, json, or md |
-output | -- | Write output to file (auto-detects format from .json or .md extension) |
-dnr | false | Do not redact: show full secret values |
-low-entropy | false | Show matches below entropy threshold |
-no-color | false | Disable colored text output |
-version | -- | Show version information |
-help | -- | Show usage information |
Examples
# Scan a directory with default rules
poltergeist /path/to/code
# JSON output for CI/CD integration
poltergeist -format json /path/to/code
# Markdown report to file
poltergeist -output report.md /path/to/code
# Custom rules
poltergeist -rules ./my-rules.yaml /path/to/code
# Force Hyperscan engine
poltergeist -engine hyperscan /path/to/code
# Show low-entropy matches for investigation
poltergeist -low-entropy /path/to/code
# Combine custom rules with inline patterns
poltergeist -rules ./rules /path/to/code "api[_-]?key\s*[:=]\s*['\"]([^'\"]+)"
Output formats
Text (default) -- colored, human-readable output grouped by file:
SCAN SUMMARY
Files scanned: 1,247
Total content: 48 MB
Secrets found: 3
src/config/api.ts
Line 15: OpenAI API Key
sk-proj-0JdlOY****hDvSYA
ID: ghost.openai.1
Entropy: 5.2 | Threshold: 5.1 | Met: Yes
Duration: 0.8s
JSON -- structured output for programmatic consumption:
{
"summary": {
"files_scanned": 1247,
"files_skipped": 23,
"total_bytes": 50331648,
"matches_found": 3,
"high_entropy_matches": 3,
"low_entropy_matches": 0
},
"results": [
{
"file_path": "src/config/api.ts",
"line_number": 15,
"redacted": "sk-proj-0JdlOY****hDvSYA",
"rule_name": "OpenAI API Key",
"rule_id": "ghost.openai.1",
"entropy": 5.2,
"rule_entropy_threshold": 5.1,
"rule_entropy_threshold_met": true
}
]
}
Markdown -- report format with tables and findings sections, suitable for documentation or issue tracking.
Exit codes
0-- scan completed, no secrets found1-- scan completed, secrets were found (or output is JSON/Markdown)
Rule authoring
There are two main strategies for writing secret detection patterns with Poltergeist:
- Explicit secret format, when the format is known and static
- Variable declaration detection (targeting likely ways the secret might be declared as a variable)
In general, we prefer to write more rules that are more precise, more specific, and easier to reason about, rather than fewer rules that are more general. The performance penalty of more rules is negligible.
Explicit secret format
When the format is known and static, we can use a regex pattern to match the secret. Often these types of secrets have a known prefix, a fixed length, and sometimes a magic string (e.g. OpenAI API keys have a magic string T3BlbkFJ).
Variable declaration detection
When the format is not known, we look for likely ways the secret might be present in source code when declared as a variable. We look for variables that are unique to the secret provider. For example, Azure Storage Account keys are a fixed length, but no predictable format. We try to match variations on Azure (case insensitive) and a high entropy fixed length string. Avoid generic variable names like TOKEN as it will be more difficult to map back to a specific secret provider.
Capture group
Currently we only expect one capture group from the regex pattern. If the secret is a known format, the capture group should be just the secret itself.
The Huggingface rule, for example:
(?x)
\b
(hf_(?i)[A-Z0-9]{34})
\b
Matches the token like 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 exactly.
However, if we are looking for a variable declaration secret, we capture the variable name in addition to the secret value.
The Clearbit rule, for example:
(?x)
\b
(
(?i)clearbit\w*(?:token|key|secret)\w*
[\W]{0,40}?
[A-Z0-9_-]{35}
)
\b
Matches the CLEARBIT_TOKEN variable as well as the secret value in the case of export CLEARBIT_TOKEN="td3aCzKhouIIgiua1d6Yvl5veaTNHMFbb7H".
Though this lowers the entropy of the match overall, it allows us to see (even in redacted logs) more information about the match. It is easier to understand how to match occurred and potentially if/how the match is a false positive.
Non-word matcher
We enlarged the non-word matcher from 10 to 40 characters to allow for more whitespace between the variable name and the secret([\W]{0,10}? -> [\W]{0,40}?).
Backwards compatibility
Do not change rule numbers. If a rule needs to be deprecated, delete it without changing the number of other rules.
YAML Format
Example Poltergeist rule file:
rules:
- name: Anthropic API Key
id: ghost.anthropic.1
description: Matches an Anthropic API key.
tags:
- api
- anthropic
pattern: |
(?x)
\b
(sk-ant-api\d{2}-(?i)[A-Z0-9_-]{86}-(?i)[A-Z0-9_]{6}AA)
\b
entropy: 5.1
redact: [16, 4]
tests:
assert:
- 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
assert_not:
- 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
history:
- 2025-08-02 initial version
Rule Components
Required
name: The name of the ruleid: Globally unique identifier for the ruledescription: The description of the rule. This is user facing contenttags: The tags used to categorize the rulepattern: The regex pattern for matchingentropy: The minimum entropy threshold for matchesredact: The prefix and suffix of the match to preserve, redact the resttests: The test cases for rule validationhistory: The change history of the rule (at least one entry)
Optional
refs: URLs of external resources supporting the secret detection approach or explaining when/where/how the secret is typically usednotes: useful notes or references for future rule authors/editors
False Positive Mitigation
We employ some common techniques to reduce false positives in real-time during the scan.
Boundaries
Use word boundaries (\b) when possible to reduce false positives. Word boundaries indicate where non-word characters occur. This helps prevent false positives from matching in the middle of a word.
Entropy
Use entropy to filter out false positives. True secrets, keys, and cryptographic material should have high entropy. Specifying the entropy field forces the rule to only match secrets with an entropy greater than or equal to the specified value.
The calculated Shannon entropy and the rule threshold are both included in the output, allowing you to see exactly why a match was flagged or filtered.
Stop Words
Stop words are words that are common in the English language and should not appear in most valid secrets.
⚠️ Not implemented: we aren't yet checking for stop words in the matching engine, but the accompanying skill DOES consider common stop words during analysis
Redaction
The redaction points are the prefix and suffix of the match to preserve, the rest of the match is redacted.
For example, if the match is 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, and the redaction points are [16, 6], the redacted match will be:
sk-ant-api03-bvf*****pzoAAA
The first 16 and the last 6 characters are preserved. The rest of the match is redacted.
Performance Tips
- Use specific patterns: More specific regex patterns are faster than broad ones
- Boundaries: Use
\bboundaries in regex patterns when possible to reduce false positives
Skill integration
When used through the ghost-scan-secrets skill, Poltergeist's JSON output feeds into AI context assessment. The skill parses matches into candidates, assesses each one (real vs. placeholder, hardcoded vs. environment variable, production vs. test), and writes confirmed findings with severity assessments and remediation guidance. See Secret scanning for details.