FyPDF — PDF & file conversion suite · Advanced

Read the scan, repair the file, archive the result.

Most online tools wave at conservation work — a free OCR that swallows accents, an 'archive' button that emits 'PDF/A-ish' files, and no repair option for the corrupted download you actually need to read. The conservation lab does the careful version. Four operations, every one with a written diagnostic, every output veraPDF-validated where the spec asks for it.

Open the conservation lab Read the 4 diagnostics

Source kept untouched
Every operation produces a separate output — your original file is never overwritten. Compare side-by-side; throw the result out and re-run with different settings if it isn't what you wanted.
Diagnoses are transparent
Each operation logs what it found and what it did. OCR records confidence per page, repair lists the structural problems it fixed, archive validates against the chosen PDF/A level, flatten counts the layers collapsed.
Standards are the standard
PDF/A-1, A-2, and A-3 conformance are real conformance — verifiable with veraPDF. The output isn't 'PDF/A-ish'; it passes spec validation.
Reversible where possible
OCR and archive are additive — the operation can be re-run with different settings. Repair recovers what it can. Flatten is irreversible by definition; the source preserves the layered original.

Intake · Specimen 0142On bench

Specimen

×4

CatalogC/2026/0142

Scanned · 600 dpi · 12 pages · 4.2 MB

Diagnostic readout

01Read

scan → searchable layer · 98% confidence · 12 pages

02Restore

xref rebuilt · 3 objects recovered · 1 lost

03Archive

PDF/A-2u verified · veraPDF: pass

04Settle

12 layers → 1 layer · 8 forms baked

Status04:12 · all readouts complete

conservation lab · station 07

4 diagnostics · ready

Conservation Lab · 4 readouts running

Encrypted sessionveraPDF-validated

The Diagnostics Ledger

Four operations, every one with a written finding.

Each row names the operation, its mode, the editorial promise, and a sample diagnostic readout — the kind of finding the engine records on every actual run.

Intake Register · Issue 07Advanced Track · 4 stations on bench

No.

Operation

Promise

Diagnostic readout

Open

01Stn.

OCR PDF

ReadEncrypted

Recover real, selectable text from scanned PDFs — 100+ languages, per-page confidence scoring, optional manual override per region. The output is a hybrid PDF: original page image stays visible, recognised text rides invisibly underneath for search.

Sample readout

scan → searchable layer · 98% confidence · 12 pages

02Stn.

Repair PDF

RestoreEncrypted

Recover content from corrupted or partially-uploaded PDFs — malformed cross-reference table, broken trailer, missing object streams. Fixes what's recoverable, reports what isn't, never silently degrades the output.

Sample readout

xref rebuilt · 3 objects recovered · 1 lost

03Stn.

PDF to PDF/A

ArchiveEncrypted

Convert to PDF/A-1, A-2, or A-3 conformance for ISO 19005 long-term storage. Embeds fonts, attaches ICC colour profiles, removes external dependencies, and validates the result with veraPDF before emitting.

Sample readout

PDF/A-2u verified · veraPDF: pass

04Stn.

Flatten PDF

SettleCloud-enhanced

Collapse layers, form fields, and annotations into the base content stream — a single immutable document for downstream pipelines that can't process layered PDFs. The source PDF is preserved alongside.

Sample readout

12 layers → 1 layer · 8 forms baked

The Fidelity Manifesto

What the conservator never quietly damages.

Four promises tagged onto every operation. The cheap path is to OCR with no confidence reporting and emit 'PDF/A-ish' files — these are the four guarantees you give up when you take it.

TagPromise · 01C/2026/0001

Source kept untouched

The conservator's first rule. FyPDF never overwrites your original — every operation produces a separate output you can compare side-by-side with the source, accept, throw out, or re-run with different settings. The original file stays exactly where you put it.

SpecOutput is a separate file · Source preserved unmodified · Versioned naming for each pass.

TagPromise · 02C/2026/0002

Every diagnosis is transparent

Each operation produces a written diagnostic alongside the file: OCR records per-page confidence and the regions it couldn't resolve; Repair lists every structural fault and what was reconstructed vs lost; Archive validates against the chosen PDF/A level and records any non-conformance; Flatten counts the layers, fields, and annotations it baked.

SpecPer-operation receipt · Confidence scoring · Damage report · Conformance log.

TagPromise · 03C/2026/0003

Standards are the standard

PDF/A conformance is real conformance — A-1 (long-term basics), A-2 (with layers / transparency), A-3 (with embedded source files). The output is validated with veraPDF before it's released. If it doesn't pass, the engine tells you what failed and offers a downgrade or repair pass instead of releasing a 'PDF/A-ish' result.

SpecISO 19005 conformance · veraPDF post-validation · Failure surfaced before release.

TagPromise · 04C/2026/0004

Reversible where possible

OCR adds a text layer; the visible page is unchanged, the original PDF can be recovered from the source. Archive operations can be re-run with different settings. Repair recovers what it can. Flatten is irreversible by definition — the conservator keeps the source so the layered original is never lost to a mis-applied flatten.

SpecOCR / archive: additive, source-recoverable · Repair: best-effort, logged · Flatten: terminal, source preserved.

The Tool Spreads

Read each operation like a lab note.

What each operation preserves, transforms, and where it pairs back into the rest of the suite. Four spreads, in order.

SpreadScanned PDF → Searchable PDF

OCR PDF

Add a real, selectable text layer to a scanned PDF without touching the visible page image. The recogniser handles 100+ languages with per-page detection and a manual override panel for hand-written or specialty regions. Output is a hybrid PDF — the visible scan stays exactly as you uploaded it, the recognised text rides invisibly beneath so search engines and reader search bars can find words.

Preserves

Original scan image at upload resolution
Page count and order
Bookmarks and annotations from the source
Embedded metadata (XMP, where present)

Transforms

Per-page language detection · 100+ scripts
Confidence scoring on every recognised region
Manual override panel for low-confidence areas
Hybrid PDF output (image visible, text searchable)

Formats.pdf in (scan).pdf out (hybrid)100+ languages

Open OCR PDF

Pairs withConvert Organize Edit

scan layer · pixels+text layer · invisible

A99

n98

n97

u96

a95

l91

·100

R76

e88

p94

≥95%

85–94%

<85% · review

page 04 / 12avg 94%

confidence map · 100+ langs

SpreadCorrupted PDF → Recovered PDF

Repair PDF

Recover content from corrupted or partially-uploaded PDFs. The repair engine walks the file looking for valid object streams, rebuilds the cross-reference table from what it finds, reconstructs a valid trailer, and emits the recovered document with a written report listing what was reconstructed and what was beyond recovery.

Preserves

Recoverable page content and structure
Salvageable bookmarks and annotations
Embedded fonts where the font tables are intact
Source file always retained alongside

Transforms

Cross-reference table rebuilt from valid objects
Trailer reconstructed
Damaged object streams recovered or flagged as lost
Damage report emitted alongside the output

Formats.pdf in (corrupt).pdf out (recovered)damage report

Open Repair PDF

Pairs withOrganize Edit Security

xref table · before / afterobj 1 / 6

OBJOFFSETSTATE

0001 00000000018intact

0002 00000000142intact

0003 0000??????rebuilt

0004 00000000884intact

0005 0000??????lost

0006 00000001220intact

4Intact

1Rebuilt

1Lost

xref reconstructed · damage logged

SpreadPDF → PDF/A-1 / 2 / 3

PDF to PDF/A

Convert any PDF into ISO 19005 conformance for long-term archive. Choose your level — A-1 (basic, no layers), A-2 (with layers and transparency), or A-3 (with embedded source files for round-trippable archives). The output is validated with veraPDF before release; if conformance fails, the conservator surfaces the reason instead of releasing a 'PDF/A-ish' file.

Preserves

Document content, structure, and page order
Existing fonts (re-embedded as full subsets)
Tagged structure tree (PDF/UA accessibility)
Cross-references, hyperlinks, outlines

Transforms

Fonts → fully embedded subsets
Colour spaces → ICC profiles attached
External dependencies → removed or embedded
veraPDF post-validation against chosen level

Formats.pdf in.pdf/a outA-1 / A-2 / A-3

Open PDF to PDF/A

Pairs withConvert Office Security

conformance · level A-2uverified

Fonts embedded as subsets
ICC colour profile attached
External hyperlinks stripped
Encryption removed
JavaScript stripped
Tag tree validated
veraPDF pass · A-2u

veraPDF

PASS · 2026-05-03

ISO 19005 · veraPDF validated

SpreadLayered PDF → Single content stream

Flatten PDF

Collapse layers, form fields, and annotations into the base content stream — useful when the downstream pipeline can't process layered PDFs (legal evidence handoffs, print-ready masters, ingestion into systems that ignore layers). The flatten is terminal by design; the source PDF is always preserved alongside so the layered original isn't lost.

Preserves

Source PDF preserved unmodified
Visible appearance pixel-identical post-flatten
Hyperlinks and outline destinations
Embedded fonts and resources

Transforms

Optional content groups (layers) → single layer
Form fields → static page content
Annotations and comments → baked into the page
Receipt records what was flattened

Formats.pdf in (layered).pdf out (flat)receipt log

Open Flatten PDF

Pairs withOrganize Optimize Security

layer stack · before6 layers

💬Comments×12

▭Form fields×8

✎Annotations×23

▣OCG layer 2×1

▢OCG layer 1×1

▥Base content×1

after · 1 layersingle content stream

N layers → 1 content stream

The OCR Confidence Atlas

Honest accuracy bands, by script.

Every OCR engine claims "100+ languages". Most don't tell you what accuracy you actually get. The atlas does — script family by script family, the confidence band on clean printed input plus a conservator's note on the hard cases.

Script Atlas · Recogniser v4.28 script families · clean printed input

Script familyExamplesConfidence bandCaveat

01Latin

EnglishSpanishFrenchGermanPortugueseItalian+3

min · 96%96–99%max · 99%

Cleanest input on the planet — printed Latin scripts in modern fonts top out around 99% confidence.

02Cyrillic

RussianUkrainianBulgarianSerbianMacedonian

min · 94%94–98%max · 98%

Pre-Soviet typography drops a few points; modern print is essentially Latin-grade.

03Greek

Modern GreekPolytonic Greek (with diacritics)

min · 93%93–97%max · 97%

Polytonic diacritics on classical texts shave a few percent — manual override panel handles edge glyphs.

04CJK

Simplified ChineseTraditional ChineseJapaneseKorean

min · 90%90–96%max · 96%

Vertical layouts and kerning-rich typesetting need a per-page model; mixed Han + kana scores towards the high end.

05Arabic & Hebrew

ArabicPersianUrduHebrewYiddish

min · 88%88–95%max · 95%

Right-to-left layout and connected glyphs are handled at the script level — line direction never gets crossed in the output.

06Devanagari & related

HindiMarathiSanskritNepaliBengaliTamil

min · 86%86–94%max · 94%

Conjunct ligatures pull the band a few points lower than Latin; modern Unicode fonts score at the top.

07Handwriting (any script)

Print handCursiveField notebooks

min · 60%60–85%max · 85%

Wide variance — neat print lands in the high 80s, freeform cursive in the 60s. Manual review recommended.

08Old / pre-1900 typography

FrakturOld English printLong-s manuscripts

min · 70%70–88%max · 88%

Recogniser ships pre-1900 type models for major languages; results improve sharply with high-DPI source scans.

All bands measured on 600 dpi clean print · field scans variable

PDF/A Conformance Matrix

Three levels, every feature accounted for.

Regulators don't ask whether a file is "PDF/A" — they ask which level. The matrix names every feature the spec touches and where each level lands. Black square means forbidden; tick means required at this level; struck dash means strict-only.

PDF/A-1

ISO 19005-1 (2005)

The strictest archival level. No layers, no transparency, no embedded files. Built for long-term predictability.

Variants

A-1b (basic visual)
A-1a (with accessibility tags)

PDF/A-2

ISO 19005-2 (2011)

Adds layers, transparency, JPEG2000 compression, and digital signatures. The pragmatic mid-tier.

Variants

A-2b (basic)
A-2u (Unicode mapping)
A-2a (accessibility tags)

PDF/A-3

ISO 19005-3 (2012)

Same feature set as A-2 plus embedded source files — the round-trippable archive that keeps the .docx alongside the PDF/A.

Variants

A-3b
A-3u
A-3a

Feature

PDF/A-1ISO 1 (2005)

PDF/A-2ISO 2 (2011)

PDF/A-3ISO 3 (2012)

Notes

Embedded fonts (full subsets)

Required at every level — no external font dependencies.

ICC colour profile attached

Mandatory if any colour space is used.

External hyperlinks (web URLs)

Stripped on archive — links don't survive 30 years of internet rot.

Optional content groups (layers)

PDF/A-1 forbids them; A-2 and A-3 allow.

Transparency and blending modes

PDF/A-1 forbids; A-2 introduces controlled transparency.

JPEG2000 image compression

A-1 limits to baseline JPEG; A-2 adds JPEG2000.

Digital signatures (PAdES)

A-1 doesn't profile signatures; A-2 adds embedded signature support.

Embedded source files (.docx, .xlsx)

A-3 is the only level that allows embedding the original — useful for round-trippable archives.

Multimedia (video / audio)

Forbidden at every PDF/A level — archives are static documents.

JavaScript / interactive forms

Stripped on archive — interactive content can't be guaranteed to render in 2055.

Encryption / passwords

Forbidden — archives must remain readable indefinitely.

Accessibility tag tree (PDF/UA)

Required at the 'a' sub-level (1a, 2a, 3a) for accessibility-grade archives.

Required / supported

Not in this level

Forbidden by spec

veraPDF · post-validated on every output

The Damage Catalogue

Six ways a PDF breaks and what we do about it.

Most "PDF repair" tools are a marketing word for "guessing". The conservator names the damage by type, explains the cause, and is honest about recovery odds before the operation runs.

xref

Type · 01Malformed cross-reference table

high

SymptomAdobe refuses to open · Preview shows blank pages · 'damaged file' dialog

CauseThe xref table that maps object IDs to byte offsets got corrupted — usually a partial download or a mid-write storage failure.

RecoveryWalks the file looking for valid 'obj … endobj' boundaries, rebuilds the cross-reference table from intact objects, and writes a clean trailer.

—

Type · 02Missing or broken trailer

high

SymptomReader finds the file structure but can't locate the catalog · 'no document root'

CauseThe trailer at the end of the file (the index of indices) is missing or malformed — common after a truncated transfer.

RecoveryReconstructs a minimal valid trailer pointing at the recovered xref. Defaults to the last valid catalog object found in the body.

obj

Type · 03Damaged object stream

medium

SymptomSpecific pages or images render blank · 'cannot decompress object 24 0' errors

CauseA compressed object stream got partially corrupted — typically Flate decompression fails midway through.

RecoveryRecovers as much of the stream as decompresses cleanly; flags the unrecoverable portion in the damage report and emits a placeholder for the affected page or resource.

↳

Type · 04Partial / truncated download

medium

SymptomFile opens but ends mid-document · last few pages missing or unrenderable

CauseNetwork transfer cut off before the final EOF marker. The file has no trailer, the xref is incomplete, and the last object is partially written.

RecoveryRebuilds the xref from intact objects up to the corruption boundary. Recovered output has the pages that survived; lost pages are listed in the report.

Type · 05Corrupted font subset table

medium

SymptomPages render but text shows as rectangles or wrong glyphs · 'Cannot extract font' warning

CauseEmbedded font subset table got truncated or partially overwritten. Glyph metrics survive but the encoding map is broken.

RecoveryAttempts to reconstruct the encoding from the ToUnicode CMap if intact. If not, falls back to substitute fonts at matching metrics and flags the loss.

🔒

Type · 06Encryption header damage

low

SymptomReader prompts for password but won't accept any · 'invalid security handler'

CauseThe encryption dictionary is corrupted — the cipher and key length are present but the permission flags or owner-key hash are malformed.

RecoveryIf the user supplies the correct owner password, reconstructs the encryption dictionary and emits an unlocked output. Without the password, the operation refuses.

Recovery likelihood

High recovery

Partial recovery

Low recovery

Receipt Anatomy

Every operation leaves a written record.

Conservation work without a receipt is just opinion. Every Advanced operation emits a signed diagnostic receipt that names what was done and proves the output corresponds to the input.

FyPDF · Diagnostic Receipt· #4F-21-D8 ·

Issued · 2026-05-03 · Conservator session 6f

1OPERATION · OCR_PDF · v4.2.1 · 2026-05-03T09:14:22Z

2SOURCE · sha256: 3F4A…7C12 · 4,218,624 bytes

3OUTPUT · sha256: 7B91…F034 · 4,612,940 bytes

4FINDINGS · 12 pages · 11 above 95% confidence · 1 page (p. 8) at 76% — manual review suggested

5ACTIONS · text-layer added (12 pages) · language detect (eng/auto) · ToUnicode map embedded

6CONFORMANCE · n/a (OCR is additive) — original veraPDF state preserved

7SIGNATURE · ed25519: 9A4F…D8C2 · session: 6f-2026-05-03

END OF RECEIPT · 7 blocks · ed25519

Operation header

Names the conservation operation, the engine version, and the timestamp. The receipt is uniquely indexed for audit traceback.

Source fingerprint

SHA-256 of the input file at the moment the operation started — proof that the receipt corresponds to the document the conservator received.

Output fingerprint

SHA-256 of the emitted file — pin the result to its receipt for downstream verification.

Findings

What the conservator found in the source. OCR records per-page confidence; Repair lists every fault detected; Archive enumerates non-conformant attributes; Flatten counts the layers and forms encountered.

Actions

What the conservator did. Each action is logged separately so the audit trail is itemised — useful when a downstream reader asks 'why does the output differ from the source?'

Conformance

Where the spec applies, the receipt records the validation result. PDF/A operations carry a veraPDF result; OCR operations carry a per-language confidence band.

Signature

Receipt is signed with the engine's session key. Any tampering with the receipt body invalidates the signature and surfaces in the audit pipeline.

Before · After

What the bench actually returns.

Three real specimens that come through the conservation lab. Condition on intake on the left, condition on release on the right, treatment record at the top.

Treatment recordScanned dossier → searchable archiveOperationOCR PDF

Intake conditionreceived

Field-scanned dossier

180 pages of scanned correspondence, photographs of pages, mixed languages, no text layer. Search returns nothing for any phrase.

.pdf

OCR PDFScanned dossier → searchable archive

Release conditiontreated

Hybrid OCR'd PDF

Same visible scan, with a recognised text layer underneath. 96% confidence on Latin, 91% on the Cyrillic appendix. Search finds names, dates, and clause references in milliseconds.

.pdf

TakeawayBecomes a real archive — researchers can search instead of scrolling, citations link to actual phrases.

Treatment recordCorrupted file → recovered masterOperationRepair PDF

Intake conditionreceived

Half-uploaded contract

Partial transfer left the file with a malformed cross-reference table and a missing trailer. Adobe refuses to open it; Preview shows blank pages.

.pdf

Repair PDFCorrupted file → recovered master

Release conditiontreated

Recovered PDF + damage report

xref table rebuilt from intact objects; 142 of 146 objects recovered; trailer reconstructed; 4 objects flagged as lost beyond recovery in the report.

.pdf

TakeawayRecovers the document the storage failure tried to take — with an honest written report of what couldn't come back.

Treatment recordLive PDF → audit-grade archiveOperationPDF to PDF/A

Intake conditionreceived

Working PDF

Hyperlinks to external pages, web fonts referenced (not embedded), CMYK colour without an ICC profile. Fine to read today; will degrade over decades.

.pdf

PDF to PDF/ALive PDF → audit-grade archive

Release conditiontreated

PDF/A-2u archival master

Fonts fully embedded as subsets, ICC profile attached, external references either embedded or stripped, veraPDF validation passed against PDF/A-2u.

.pdf

TakeawayStable for thirty years — same content, same appearance, no external dependencies that can rot.

Who works the conservation lab

Five regulars at the conservator's bench.

The personas who reach for conservation work weekly — and the specific operations they run.

Persona · 01The archivist

Forty-year masters, audit-grade conformance

Public-record holdings, regulatory filings, library archives — anything that needs to outlive the software it was authored in. PDF/A conformance is the line; veraPDF validation is the receipt.

Reaches for

PDF to PDF/A · PDF/A-2u · veraPDF validated
OCR PDF · Pre-archive · scans get a text layer

Persona · 02The records officer

FOIA / RTI hand-offs that pass scrutiny

Disclosures need to arrive searchable, conformant, and beyond reasonable doubt about provenance. OCR'd, archived, flattened — every step on a written receipt.

Reaches for

OCR PDF · Scans → searchable disclosure copy
Flatten PDF · Annotations baked · pre-handoff master

Persona · 03The historian

Manuscripts that respect the original page

Photographed pages of nineteenth-century correspondence get a recognised text layer for research search — without the visible page being touched. The scan stays the scholarship; the OCR enables the index.

Reaches for

OCR PDF · Hybrid PDF · scan visible · text searchable
Repair PDF · Recover legacy archive masters

Persona · 04The compliance lead

Regulator wants PDF/A, not 'a PDF'

Annual filings, regulatory exhibits, audit deliverables — the spec says PDF/A and the spec means it. veraPDF validation is the only proof that's not vibes.

Reaches for

PDF to PDF/A · Per-filing · level mandated by regulator
Flatten PDF · Pre-archive · forms baked · layers settled

Persona · 05The data-pipeline engineer

Predictable inputs, no hidden layers

Downstream ingestion can't handle layered PDFs, won't render forms, and chokes on encryption. Flatten before ship; OCR for any scan that arrives in the queue. Receipts feed the audit trail.

Reaches for

Flatten PDF · Layers + forms + annotations → 1 layer
OCR PDF · Scans → searchable for the indexing pass

Common Questions

Before you send anything to the bench, a few honest answers.

Question Index

Q01 · 01 / 07

What does OCR do to my scanned PDFs?

FyPDF's OCR adds a recognised text layer to a scanned PDF without modifying the visible page image. The result is a "hybrid" PDF: when a reader opens it they see the original scan exactly as you uploaded it, but search bars and indexing engines can find words because the recognised text rides invisibly beneath. The recogniser handles 100+ languages with per-page detection and a manual override panel for low-confidence regions.

Conservation Reference · 01

7 questions in the conservation FAQIssue 07 · Advanced

Send to the lab

The bench is set. Tell it which station to run.

Drop the file, pick a station, take the result with its receipt. OCR for scans. Repair for damage. PDF/A for archive. Flatten for downstream pipelines.

Open the conservation lab Tour every track in detail

Lab dispatch · 4 stations

Issue 07

OCR PDFRead · 100+ langs

Repair PDFRestore · damage report

PDF to PDF/AArchive · veraPDF

Flatten PDFSettle · 1 layer

All stations · written diagnostic · receipt issued

One Suite · Seven Tracks · Twenty-eight Tools and CountingStart with the surface →