doc (HWPF) Overview

HWPF provides early support for the legacy Word 97-2003 .doc format. Coverage is ~25%.

Status

HWPF is useful today for text extraction, simple indexing/migration workflows, and limited main-body edits. It can open OLE2 .doc files, parse the File Information Block and selected table stream, extract main document text from the CLX/piece table, extract table and header/footer text, expose a minimal Range/Paragraph/CharacterRun model, and preserve unedited OLE streams/storages during no-op or limited body edits.

It is not a complete Word binary editing engine. Images, footnotes, comments, fields, and tracked changes are not modeled through public APIs.

Basic Text Extraction

using DotnetPoi.HWPF.UserModel;

using var stream = File.OpenRead("input.doc");
using var doc = new HWPFDocument(stream);

Console.WriteLine(doc.getText());

Limited Body Editing

using DotnetPoi.HWPF.UserModel;

using var stream = File.OpenRead("input.doc");
using var doc = new HWPFDocument(stream);

doc.appendParagraph("Added by dotnet-poi");
doc.replaceText("{{name}}", "Example Corp");

using var output = File.Create("output.doc");
doc.write(output);

Supported Today

OLE2 .doc open and stream inventory
FIB / 0Table / 1Table parsing
Main body text extraction from CLX/piece table
Compressed and Unicode text pieces
Header/footer text and table structure extraction
Minimal Range / Paragraph / CharacterRun API
Selected CHPX character formatting: font name, size, bold, italic, underline, strike
Minimal paragraph property reading
No-op write preservation
Limited append paragraph and simple text replacement
OLE stream/storage and embedded OLE preservation
Java POI bidirectional testing for text extraction

Limitations

No image extraction or editing API
No footnote/endnote/comment model
No field/bookmark model
No full Word style inheritance or complete PAPX/CHPX expansion
Limited body edits rebuild the main body as a single Unicode piece; use cautiously for complex documents

For new documents, prefer docx / XWPF.