07 April 2026

If your Umbraco site needs to search the contents of Word documents, PowerPoint presentations, or Excel spreadsheets uploaded to the media library, I have just published a NuGet package that handles exactly that. Umbraco.Community.Examine.OpenXml uses the OpenXml SDK to extract text from Office documents and index them using Examine, making them fully searchable alongside the rest of your site content.

From GitHub Project to Proper Package

Back in July 2023 I wrote a post about indexing Microsoft Office documents in Umbraco 10 using OpenXML. At the time, the solution was a GitHub repository you had to download and manually integrate into your own solution. It worked, but it meant carrying the code yourself, keeping it up to date, and wiring up the startup registration by hand.

I have now packaged it properly, published it to NuGet, and updated it to support Umbraco 13, 16, and 17. You can install it with a single command and have Office document indexing working in minutes, with no custom code required in your project.

Why This Gap Existed

By default Umbraco only indexes the name of media items, not their content. If you want the text inside uploaded files to appear in search results, you need to build a custom index and populate it when files are uploaded or updated.

What Was Available in the Umbraco 7 Era

In the Umbraco 7 days the main option was the Cogworks ExamineFileIndexer package, which used Apache Tika under the hood to extract text from .docx, .pdf, and other file types. It worked, but it required manually copying Tika DLLs into your bin folder and had a dependency on a Java-based library via IKVM, which made deployment more involved than most developers wanted. The last release supported Umbraco 7.14.x and it was never updated for modern Umbraco.

The Gap in Modern Umbraco

When Umbraco moved to the modern .NET architecture from version 9 onwards, the old Cogworks package was left behind. The official UmbracoExamine.PDF package covers PDF files well, but there has been no equivalent for Word, Excel, or PowerPoint documents using standard Lucene-based Examine. A GitHub issue raised against the Umbraco CMS repository as far back as 2018 explicitly requested an Office document equivalent of UmbracoExamine.PDF, and nothing has appeared from Umbraco HQ or the community since.

The one exception worth mentioning is ExamineX, which can index Office documents as part of a much larger proposition: it replaces Lucene entirely with Azure AI Search, requires Azure Blob Storage as the media provider, and carries a commercial licence at $1,000 per year per site. It is an excellent product for enterprise sites on Azure, but it is solving a different set of problems. If you are running a standard Umbraco site on Lucene and just want Office document search to work, that is significant infrastructure and cost overhead for what should be a straightforward feature.

Umbraco.Community.Examine.OpenXml fills that gap: a free, open-source, self-contained package that works with standard Lucene-based Examine and requires nothing beyond a single NuGet install.

How It Works

The package hooks into Umbraco's media events and uses the DocumentFormat.OpenXml SDK to extract text from Office documents when they are uploaded or updated. The extracted text is stored in a dedicated Lucene-based Examine index called OpenXmlIndex. The index stays in sync automatically as media items are created, updated, or deleted, and it can be manually rebuilt from the Examine Management dashboard in the backoffice.

The approach is based directly on UmbracoExamine.PDF. If you have used that package, this one will feel immediately familiar.

Supported File Types

Extension	Type
`.docx`	Word documents
`.pptx`	PowerPoint presentations
`.xlsx`	Excel spreadsheets

Supported Umbraco Versions

Umbraco	.NET	Status
13.x	.NET 8	Supported
14.x	.NET 8	Not supported (EOL)
15.x	.NET 9	Not supported (EOL)
16.x	.NET 9	Supported
17.x	.NET 10	Supported

Versions 14 and 15 reached end of life and are not supported. If you are on either of those versions, now is a good time to plan your upgrade to 16 or 17.

Installation

Install the package from NuGet:

dotnet add package Umbraco.Community.Examine.OpenXml

Or via the Package Manager Console:

NuGet\Install-Package Umbraco.Community.Examine.OpenXml

That is genuinely all there is to it. The package registers itself automatically using an Umbraco composer, so there is no additional startup configuration required. Once you restart the site the OpenXmlIndex will be available immediately. Any Office documents already in your media library can be indexed by triggering a rebuild from the Examine Management dashboard in the backoffice.

Searching the Index

Once the index is populated you can query it from any Razor view or controller using Examine's IExamineManager. The package exposes an OpenXmlIndexConstants class with strongly-typed constants so you do not have to remember magic strings.

Basic Search Example

Here is a straightforward example in a Razor view:

@using Examine
@using Examine.Search
@using Umbraco.Community.Examine.OpenXml

@inject IExamineManager ExamineManager

@{
    var searchQuery = Context.Request.Query["q"].ToString();
}

@if (!string.IsNullOrWhiteSpace(searchQuery))
{
    if (ExamineManager.TryGetIndex(OpenXmlIndexConstants.OpenXmlIndexName, out var index))
    {
        var searcher = index.Searcher;
        var query = searcher.CreateQuery(OpenXmlIndexConstants.OpenXmlCategory)
            .GroupedOr(
                new[] { OpenXmlIndexConstants.OpenXmlContentFieldName, "nodeName" },
                searchQuery
            );

        var results = query.Execute();

        <p>Found @results.TotalItemCount result(s) for "@searchQuery"</p>

        foreach (var result in results)
        {
            var name = result.Values.ContainsKey("nodeName")
                ? result.Values["nodeName"]
                : "Unknown";

            <div>
                <h2>@name</h2>
                <p>Score: @result.Score.ToString("F2")</p>
            </div>
        }
    }
}

Available Constants

The OpenXmlIndexConstants class provides the following constants:

Constant	Value	Purpose
`OpenXmlIndexName`	`"OpenXmlIndex"`	Passed to `TryGetIndex()`
`OpenXmlContentFieldName`	`"fileTextContent"`	The field containing the extracted document text
`OpenXmlCategory`	`"openxml"`	Scopes the query to OpenXml documents

Combining with Other Indexes

If you are also using UmbracoExamine.PDF alongside this package, you can create a multi-searcher to query both indexes at once with a single search call:

services.AddExamineLuceneMultiSearcher(
    "MediaSearcher",
    new[]
    {
        PdfIndexConstants.PdfIndexName,
        OpenXmlIndexConstants.OpenXmlIndexName
    }
);

You can then inject IExamineManager, retrieve the MediaSearcher, and search across PDF and Office documents in one query.

Extraction Limits

To protect against malicious or oversized documents, the package applies limits during text extraction. Documents that exceed any of these limits are logged as warnings and excluded from the index rather than causing errors.

Limit	Value	Description
Max file size	100 MB	Files exceeding this size are skipped entirely
Max extracted content	10 MB	Text extraction stops once this limit is reached
Max characters per part	10,000,000	Limits characters per OpenXml document part to prevent decompression bombs
Max shared strings (Excel)	1,000,000	Caps the number of shared string entries loaded from `.xlsx` files

The Excel shared strings limit is worth calling out specifically. Excel stores cell text in a shared strings table rather than inline in each cell, so very large spreadsheets with a high volume of unique string values could otherwise consume a significant amount of memory during indexing. The cap keeps that under control without affecting typical real-world documents.

All of these values are defined in the OpenXmlIndexConstants class, so if you need to reference them in your own code they are available there.

Why OpenXml and Not Tika

One of the reasons I chose the OpenXml SDK over something like Apache Tika is that it is a pure .NET library with no external runtime dependencies. There is nothing to copy into your bin folder, no Java, and no additional infrastructure to manage. An OpenXML document is just a ZIP file containing XML parts, and the SDK gives you clean programmatic access to those parts. For Word documents you read the main document part, for PowerPoint you iterate the slide parts, and for Excel you read the shared strings table and cells across each worksheet. The extracted text is concatenated into a single string which Lucene then tokenises and indexes in the normal way.

Source Code

The full source code is on GitHub at github.com/justin-nevitech/Umbraco.Community.Examine.OpenXml. If you have been using the original code from the 2023 post and maintaining it yourself in your own solution, I would encourage you to switch to the package. It is easier to keep up to date and means you get any bug fixes without having to carry the code yourself.

Issues and pull requests are welcome.

Indexing Word, Excel and PowerPoint Documents in Umbraco Using Examine