FyPDF — PDF & file conversion suite · Advanced

Read the scan, repair the file, archive the result.

Most online tools wave at conservation work — a free OCR that swallows accents, an 'archive' button that emits 'PDF/A-ish' files, and no repair option for the corrupted download you actually need to read. The conservation lab does the careful version. Four operations, every one with a written diagnostic, every output veraPDF-validated where the spec asks for it.

  • Source kept untouched
    Every operation produces a separate output — your original file is never overwritten. Compare side-by-side; throw the result out and re-run with different settings if it isn't what you wanted.
  • Diagnoses are transparent
    Each operation logs what it found and what it did. OCR records confidence per page, repair lists the structural problems it fixed, archive validates against the chosen PDF/A level, flatten counts the layers collapsed.
  • Standards are the standard
    PDF/A-1, A-2, and A-3 conformance are real conformance — verifiable with veraPDF. The output isn't 'PDF/A-ish'; it passes spec validation.
  • Reversible where possible
    OCR and archive are additive — the operation can be re-run with different settings. Repair recovers what it can. Flatten is irreversible by definition; the source preserves the layered original.
Intake · Specimen 0142On bench
Specimen
×4
CatalogC/2026/0142
Scanned · 600 dpi · 12 pages · 4.2 MB
Diagnostic readout
01Read
scan → searchable layer · 98% confidence · 12 pages
02Restore
xref rebuilt · 3 objects recovered · 1 lost
03Archive
PDF/A-2u verified · veraPDF: pass
04Settle
12 layers → 1 layer · 8 forms baked
Status04:12 · all readouts complete
conservation lab · station 07
4 diagnostics · ready
Conservation Lab · 4 readouts running
Encrypted sessionveraPDF-validated
The Diagnostics Ledger

Four operations, every one with a written finding.

Each row names the operation, its mode, the editorial promise, and a sample diagnostic readout — the kind of finding the engine records on every actual run.

Intake Register · Issue 07Advanced Track · 4 stations on bench
No.
Operation
Promise
Diagnostic readout
Open
01Stn.
OCR PDF
ReadEncrypted

Recover real, selectable text from scanned PDFs — 100+ languages, per-page confidence scoring, optional manual override per region. The output is a hybrid PDF: original page image stays visible, recognised text rides invisibly underneath for search.

Sample readout
scan → searchable layer · 98% confidence · 12 pages
02Stn.
Repair PDF
RestoreEncrypted

Recover content from corrupted or partially-uploaded PDFs — malformed cross-reference table, broken trailer, missing object streams. Fixes what's recoverable, reports what isn't, never silently degrades the output.

Sample readout
xref rebuilt · 3 objects recovered · 1 lost
03Stn.
PDF to PDF/A
ArchiveEncrypted

Convert to PDF/A-1, A-2, or A-3 conformance for ISO 19005 long-term storage. Embeds fonts, attaches ICC colour profiles, removes external dependencies, and validates the result with veraPDF before emitting.

Sample readout
PDF/A-2u verified · veraPDF: pass
04Stn.
Flatten PDF
SettleCloud-enhanced

Collapse layers, form fields, and annotations into the base content stream — a single immutable document for downstream pipelines that can't process layered PDFs. The source PDF is preserved alongside.

Sample readout
12 layers → 1 layer · 8 forms baked
The Fidelity Manifesto

What the conservator never quietly damages.

Four promises tagged onto every operation. The cheap path is to OCR with no confidence reporting and emit 'PDF/A-ish' files — these are the four guarantees you give up when you take it.

A
TagPromise · 01C/2026/0001

Source kept untouched

The conservator's first rule. FyPDF never overwrites your original — every operation produces a separate output you can compare side-by-side with the source, accept, throw out, or re-run with different settings. The original file stays exactly where you put it.

SpecOutput is a separate file · Source preserved unmodified · Versioned naming for each pass.
B
TagPromise · 02C/2026/0002

Every diagnosis is transparent

Each operation produces a written diagnostic alongside the file: OCR records per-page confidence and the regions it couldn't resolve; Repair lists every structural fault and what was reconstructed vs lost; Archive validates against the chosen PDF/A level and records any non-conformance; Flatten counts the layers, fields, and annotations it baked.

SpecPer-operation receipt · Confidence scoring · Damage report · Conformance log.
C
TagPromise · 03C/2026/0003

Standards are the standard

PDF/A conformance is real conformance — A-1 (long-term basics), A-2 (with layers / transparency), A-3 (with embedded source files). The output is validated with veraPDF before it's released. If it doesn't pass, the engine tells you what failed and offers a downgrade or repair pass instead of releasing a 'PDF/A-ish' result.

SpecISO 19005 conformance · veraPDF post-validation · Failure surfaced before release.
D
TagPromise · 04C/2026/0004

Reversible where possible

OCR adds a text layer; the visible page is unchanged, the original PDF can be recovered from the source. Archive operations can be re-run with different settings. Repair recovers what it can. Flatten is irreversible by definition — the conservator keeps the source so the layered original is never lost to a mis-applied flatten.

SpecOCR / archive: additive, source-recoverable · Repair: best-effort, logged · Flatten: terminal, source preserved.
The Tool Spreads

Read each operation like a lab note.

What each operation preserves, transforms, and where it pairs back into the rest of the suite. Four spreads, in order.

01
SpreadScanned PDF → Searchable PDF

OCR PDF

Add a real, selectable text layer to a scanned PDF without touching the visible page image. The recogniser handles 100+ languages with per-page detection and a manual override panel for hand-written or specialty regions. Output is a hybrid PDF — the visible scan stays exactly as you uploaded it, the recognised text rides invisibly beneath so search engines and reader search bars can find words.

Preserves
  • Original scan image at upload resolution
  • Page count and order
  • Bookmarks and annotations from the source
  • Embedded metadata (XMP, where present)
Transforms
  • Per-page language detection · 100+ scripts
  • Confidence scoring on every recognised region
  • Manual override panel for low-confidence areas
  • Hybrid PDF output (image visible, text searchable)
Formats.pdf in (scan).pdf out (hybrid)100+ languages
scan layer · pixels+text layer · invisible
A99
n98
n97
u96
a95
l91
·100
R76
e88
p94
≥95%
85–94%
<85% · review
page 04 / 12avg 94%
confidence map · 100+ langs
02
SpreadCorrupted PDF → Recovered PDF

Repair PDF

Recover content from corrupted or partially-uploaded PDFs. The repair engine walks the file looking for valid object streams, rebuilds the cross-reference table from what it finds, reconstructs a valid trailer, and emits the recovered document with a written report listing what was reconstructed and what was beyond recovery.

Preserves
  • Recoverable page content and structure
  • Salvageable bookmarks and annotations
  • Embedded fonts where the font tables are intact
  • Source file always retained alongside
Transforms
  • Cross-reference table rebuilt from valid objects
  • Trailer reconstructed
  • Damaged object streams recovered or flagged as lost
  • Damage report emitted alongside the output
Formats.pdf in (corrupt).pdf out (recovered)damage report
xref table · before / afterobj 1 / 6
OBJOFFSETSTATE
0001 00000000018intact
0002 00000000142intact
0003 0000??????rebuilt
0004 00000000884intact
0005 0000??????lost
0006 00000001220intact
4Intact
1Rebuilt
1Lost
xref reconstructed · damage logged
03
SpreadPDF → PDF/A-1 / 2 / 3

PDF to PDF/A

Convert any PDF into ISO 19005 conformance for long-term archive. Choose your level — A-1 (basic, no layers), A-2 (with layers and transparency), or A-3 (with embedded source files for round-trippable archives). The output is validated with veraPDF before release; if conformance fails, the conservator surfaces the reason instead of releasing a 'PDF/A-ish' file.

Preserves
  • Document content, structure, and page order
  • Existing fonts (re-embedded as full subsets)
  • Tagged structure tree (PDF/UA accessibility)
  • Cross-references, hyperlinks, outlines
Transforms
  • Fonts → fully embedded subsets
  • Colour spaces → ICC profiles attached
  • External dependencies → removed or embedded
  • veraPDF post-validation against chosen level
Formats.pdf in.pdf/a outA-1 / A-2 / A-3
conformance · level A-2uverified
  • Fonts embedded as subsets
  • ICC colour profile attached
  • External hyperlinks stripped
  • Encryption removed
  • JavaScript stripped
  • Tag tree validated
  • veraPDF pass · A-2u
veraPDF
PASS · 2026-05-03
ISO 19005 · veraPDF validated
04
SpreadLayered PDF → Single content stream

Flatten PDF

Collapse layers, form fields, and annotations into the base content stream — useful when the downstream pipeline can't process layered PDFs (legal evidence handoffs, print-ready masters, ingestion into systems that ignore layers). The flatten is terminal by design; the source PDF is always preserved alongside so the layered original isn't lost.

Preserves
  • Source PDF preserved unmodified
  • Visible appearance pixel-identical post-flatten
  • Hyperlinks and outline destinations
  • Embedded fonts and resources
Transforms
  • Optional content groups (layers) → single layer
  • Form fields → static page content
  • Annotations and comments → baked into the page
  • Receipt records what was flattened
Formats.pdf in (layered).pdf out (flat)receipt log
layer stack · before6 layers
💬Comments×12
Form fields×8
Annotations×23
OCG layer 2×1
OCG layer 1×1
Base content×1
after · 1 layersingle content stream
N layers → 1 content stream
The OCR Confidence Atlas

Honest accuracy bands, by script.

Every OCR engine claims "100+ languages". Most don't tell you what accuracy you actually get. The atlas does — script family by script family, the confidence band on clean printed input plus a conservator's note on the hard cases.

Script Atlas · Recogniser v4.28 script families · clean printed input
Script familyExamplesConfidence bandCaveat
01Latin
EnglishSpanishFrenchGermanPortugueseItalian+3
min · 96%9699%max · 99%

Cleanest input on the planet — printed Latin scripts in modern fonts top out around 99% confidence.

02Cyrillic
RussianUkrainianBulgarianSerbianMacedonian
min · 94%9498%max · 98%

Pre-Soviet typography drops a few points; modern print is essentially Latin-grade.

03Greek
Modern GreekPolytonic Greek (with diacritics)
min · 93%9397%max · 97%

Polytonic diacritics on classical texts shave a few percent — manual override panel handles edge glyphs.

04CJK
Simplified ChineseTraditional ChineseJapaneseKorean
min · 90%9096%max · 96%

Vertical layouts and kerning-rich typesetting need a per-page model; mixed Han + kana scores towards the high end.

05Arabic & Hebrew
ArabicPersianUrduHebrewYiddish
min · 88%8895%max · 95%

Right-to-left layout and connected glyphs are handled at the script level — line direction never gets crossed in the output.

06Devanagari & related
HindiMarathiSanskritNepaliBengaliTamil
min · 86%8694%max · 94%

Conjunct ligatures pull the band a few points lower than Latin; modern Unicode fonts score at the top.

07Handwriting (any script)
Print handCursiveField notebooks
min · 60%6085%max · 85%

Wide variance — neat print lands in the high 80s, freeform cursive in the 60s. Manual review recommended.

08Old / pre-1900 typography
FrakturOld English printLong-s manuscripts
min · 70%7088%max · 88%

Recogniser ships pre-1900 type models for major languages; results improve sharply with high-DPI source scans.

All bands measured on 600 dpi clean print · field scans variable
PDF/A Conformance Matrix

Three levels, every feature accounted for.

Regulators don't ask whether a file is "PDF/A" — they ask which level. The matrix names every feature the spec touches and where each level lands. Black square means forbidden; tick means required at this level; struck dash means strict-only.

PDF/A-1
ISO 19005-1 (2005)

The strictest archival level. No layers, no transparency, no embedded files. Built for long-term predictability.

Variants
  • A-1b (basic visual)
  • A-1a (with accessibility tags)
PDF/A-2
ISO 19005-2 (2011)

Adds layers, transparency, JPEG2000 compression, and digital signatures. The pragmatic mid-tier.

Variants
  • A-2b (basic)
  • A-2u (Unicode mapping)
  • A-2a (accessibility tags)
PDF/A-3
ISO 19005-3 (2012)

Same feature set as A-2 plus embedded source files — the round-trippable archive that keeps the .docx alongside the PDF/A.

Variants
  • A-3b
  • A-3u
  • A-3a
Feature
PDF/A-1ISO 1 (2005)
PDF/A-2ISO 2 (2011)
PDF/A-3ISO 3 (2012)
Notes
Embedded fonts (full subsets)
Required at every level — no external font dependencies.
ICC colour profile attached
Mandatory if any colour space is used.
External hyperlinks (web URLs)
Stripped on archive — links don't survive 30 years of internet rot.
Optional content groups (layers)
PDF/A-1 forbids them; A-2 and A-3 allow.
Transparency and blending modes
PDF/A-1 forbids; A-2 introduces controlled transparency.
JPEG2000 image compression
A-1 limits to baseline JPEG; A-2 adds JPEG2000.
Digital signatures (PAdES)
A-1 doesn't profile signatures; A-2 adds embedded signature support.
Embedded source files (.docx, .xlsx)
A-3 is the only level that allows embedding the original — useful for round-trippable archives.
Multimedia (video / audio)
Forbidden at every PDF/A level — archives are static documents.
JavaScript / interactive forms
Stripped on archive — interactive content can't be guaranteed to render in 2055.
Encryption / passwords
Forbidden — archives must remain readable indefinitely.
Accessibility tag tree (PDF/UA)
Required at the 'a' sub-level (1a, 2a, 3a) for accessibility-grade archives.
Required / supported
Not in this level
Forbidden by spec
veraPDF · post-validated on every output
The Damage Catalogue

Six ways a PDF breaks and what we do about it.

Most "PDF repair" tools are a marketing word for "guessing". The conservator names the damage by type, explains the cause, and is honest about recovery odds before the operation runs.

xref
Type · 01Malformed cross-reference table
high
SymptomAdobe refuses to open · Preview shows blank pages · 'damaged file' dialog
CauseThe xref table that maps object IDs to byte offsets got corrupted — usually a partial download or a mid-write storage failure.
RecoveryWalks the file looking for valid 'obj … endobj' boundaries, rebuilds the cross-reference table from intact objects, and writes a clean trailer.
Type · 02Missing or broken trailer
high
SymptomReader finds the file structure but can't locate the catalog · 'no document root'
CauseThe trailer at the end of the file (the index of indices) is missing or malformed — common after a truncated transfer.
RecoveryReconstructs a minimal valid trailer pointing at the recovered xref. Defaults to the last valid catalog object found in the body.
obj
Type · 03Damaged object stream
medium
SymptomSpecific pages or images render blank · 'cannot decompress object 24 0' errors
CauseA compressed object stream got partially corrupted — typically Flate decompression fails midway through.
RecoveryRecovers as much of the stream as decompresses cleanly; flags the unrecoverable portion in the damage report and emits a placeholder for the affected page or resource.
Type · 04Partial / truncated download
medium
SymptomFile opens but ends mid-document · last few pages missing or unrenderable
CauseNetwork transfer cut off before the final EOF marker. The file has no trailer, the xref is incomplete, and the last object is partially written.
RecoveryRebuilds the xref from intact objects up to the corruption boundary. Recovered output has the pages that survived; lost pages are listed in the report.
f
Type · 05Corrupted font subset table
medium
SymptomPages render but text shows as rectangles or wrong glyphs · 'Cannot extract font' warning
CauseEmbedded font subset table got truncated or partially overwritten. Glyph metrics survive but the encoding map is broken.
RecoveryAttempts to reconstruct the encoding from the ToUnicode CMap if intact. If not, falls back to substitute fonts at matching metrics and flags the loss.
🔒
Type · 06Encryption header damage
low
SymptomReader prompts for password but won't accept any · 'invalid security handler'
CauseThe encryption dictionary is corrupted — the cipher and key length are present but the permission flags or owner-key hash are malformed.
RecoveryIf the user supplies the correct owner password, reconstructs the encryption dictionary and emits an unlocked output. Without the password, the operation refuses.
Recovery likelihood
High recovery
Partial recovery
Low recovery
Receipt Anatomy

Every operation leaves a written record.

Conservation work without a receipt is just opinion. Every Advanced operation emits a signed diagnostic receipt that names what was done and proves the output corresponds to the input.

FyPDF · Diagnostic Receipt· #4F-21-D8 ·
Issued · 2026-05-03 · Conservator session 6f
1OPERATION · OCR_PDF · v4.2.1 · 2026-05-03T09:14:22Z
2SOURCE · sha256: 3F4A…7C12 · 4,218,624 bytes
3OUTPUT · sha256: 7B91…F034 · 4,612,940 bytes
4FINDINGS · 12 pages · 11 above 95% confidence · 1 page (p. 8) at 76% — manual review suggested
5ACTIONS · text-layer added (12 pages) · language detect (eng/auto) · ToUnicode map embedded
6CONFORMANCE · n/a (OCR is additive) — original veraPDF state preserved
7SIGNATURE · ed25519: 9A4F…D8C2 · session: 6f-2026-05-03
END OF RECEIPT · 7 blocks · ed25519
1
Operation header

Names the conservation operation, the engine version, and the timestamp. The receipt is uniquely indexed for audit traceback.

2
Source fingerprint

SHA-256 of the input file at the moment the operation started — proof that the receipt corresponds to the document the conservator received.

3
Output fingerprint

SHA-256 of the emitted file — pin the result to its receipt for downstream verification.

4
Findings

What the conservator found in the source. OCR records per-page confidence; Repair lists every fault detected; Archive enumerates non-conformant attributes; Flatten counts the layers and forms encountered.

5
Actions

What the conservator did. Each action is logged separately so the audit trail is itemised — useful when a downstream reader asks 'why does the output differ from the source?'

6
Conformance

Where the spec applies, the receipt records the validation result. PDF/A operations carry a veraPDF result; OCR operations carry a per-language confidence band.

7
Signature

Receipt is signed with the engine's session key. Any tampering with the receipt body invalidates the signature and surfaces in the audit pipeline.

Before · After

What the bench actually returns.

Three real specimens that come through the conservation lab. Condition on intake on the left, condition on release on the right, treatment record at the top.

Treatment recordScanned dossier → searchable archiveOperationOCR PDF
Intake conditionreceived

Field-scanned dossier

180 pages of scanned correspondence, photographs of pages, mixed languages, no text layer. Search returns nothing for any phrase.

.pdf
§
OCR PDFScanned dossier → searchable archive
Release conditiontreated

Hybrid OCR'd PDF

Same visible scan, with a recognised text layer underneath. 96% confidence on Latin, 91% on the Cyrillic appendix. Search finds names, dates, and clause references in milliseconds.

.pdf
§

TakeawayBecomes a real archive — researchers can search instead of scrolling, citations link to actual phrases.

Treatment recordCorrupted file → recovered masterOperationRepair PDF
Intake conditionreceived

Half-uploaded contract

Partial transfer left the file with a malformed cross-reference table and a missing trailer. Adobe refuses to open it; Preview shows blank pages.

.pdf
§
Repair PDFCorrupted file → recovered master
Release conditiontreated

Recovered PDF + damage report

xref table rebuilt from intact objects; 142 of 146 objects recovered; trailer reconstructed; 4 objects flagged as lost beyond recovery in the report.

.pdf
§

TakeawayRecovers the document the storage failure tried to take — with an honest written report of what couldn't come back.

Treatment recordLive PDF → audit-grade archiveOperationPDF to PDF/A
Intake conditionreceived

Working PDF

Hyperlinks to external pages, web fonts referenced (not embedded), CMYK colour without an ICC profile. Fine to read today; will degrade over decades.

.pdf
§
PDF to PDF/ALive PDF → audit-grade archive
Release conditiontreated

PDF/A-2u archival master

Fonts fully embedded as subsets, ICC profile attached, external references either embedded or stripped, veraPDF validation passed against PDF/A-2u.

.pdf
§

TakeawayStable for thirty years — same content, same appearance, no external dependencies that can rot.

Who works the conservation lab

Five regulars at the conservator's bench.

The personas who reach for conservation work weekly — and the specific operations they run.

Persona · 01The archivist

Forty-year masters, audit-grade conformance

Public-record holdings, regulatory filings, library archives — anything that needs to outlive the software it was authored in. PDF/A conformance is the line; veraPDF validation is the receipt.

Reaches for
  • PDF to PDF/A · PDF/A-2u · veraPDF validated
  • OCR PDF · Pre-archive · scans get a text layer
Persona · 02The records officer

FOIA / RTI hand-offs that pass scrutiny

Disclosures need to arrive searchable, conformant, and beyond reasonable doubt about provenance. OCR'd, archived, flattened — every step on a written receipt.

Reaches for
  • OCR PDF · Scans → searchable disclosure copy
  • Flatten PDF · Annotations baked · pre-handoff master
Persona · 03The historian

Manuscripts that respect the original page

Photographed pages of nineteenth-century correspondence get a recognised text layer for research search — without the visible page being touched. The scan stays the scholarship; the OCR enables the index.

Reaches for
  • OCR PDF · Hybrid PDF · scan visible · text searchable
  • Repair PDF · Recover legacy archive masters
Persona · 04The compliance lead

Regulator wants PDF/A, not 'a PDF'

Annual filings, regulatory exhibits, audit deliverables — the spec says PDF/A and the spec means it. veraPDF validation is the only proof that's not vibes.

Reaches for
  • PDF to PDF/A · Per-filing · level mandated by regulator
  • Flatten PDF · Pre-archive · forms baked · layers settled
Persona · 05The data-pipeline engineer

Predictable inputs, no hidden layers

Downstream ingestion can't handle layered PDFs, won't render forms, and chokes on encryption. Flatten before ship; OCR for any scan that arrives in the queue. Receipts feed the audit trail.

Reaches for
  • Flatten PDF · Layers + forms + annotations → 1 layer
  • OCR PDF · Scans → searchable for the indexing pass
Common Questions

Before you send anything to the bench, a few honest answers.

Question Index
Q01 · 01 / 07

What does OCR do to my scanned PDFs?

FyPDF's OCR adds a recognised text layer to a scanned PDF without modifying the visible page image. The result is a "hybrid" PDF: when a reader opens it they see the original scan exactly as you uploaded it, but search bars and indexing engines can find words because the recognised text rides invisibly beneath. The recogniser handles 100+ languages with per-page detection and a manual override panel for low-confidence regions.
Conservation Reference · 01
7 questions in the conservation FAQIssue 07 · Advanced
Send to the lab

The bench is set. Tell it which station to run.

Drop the file, pick a station, take the result with its receipt. OCR for scans. Repair for damage. PDF/A for archive. Flatten for downstream pipelines.

Lab dispatch · 4 stations
Issue 07
All stations · written diagnostic · receipt issued
One Suite · Seven Tracks · Twenty-eight Tools and CountingStart with the surface →