If your Umbraco site needs to search the contents of Word documents, PowerPoint presentations, or Excel spreadsheets uploaded to the media library, I have just published a NuGet package that handles exactly that. Umbraco.Community.Examine.OpenXml uses the OpenXml SDK to extract text from Office documents and index them using Examine, making them fully searchable alongside the rest of your site content.
From GitHub Project to Proper Package
Back in July 2023 I wrote a post about indexing Microsoft Office documents in Umbraco 10 using OpenXML. At the time, the solution was a GitHub repository you had to download and manually integrate into your own solution. It worked, but it meant carrying the code yourself, keeping it up to date, and wiring up the startup registration by hand.
I have now packaged it properly, published it to NuGet, and updated it to support Umbraco 13, 16, and 17. You can install it with a single command and have Office document indexing working in minutes, with no custom code required in your project.
Why This Gap Existed
By default Umbraco only indexes the name of media items, not their content. If you want the text inside uploaded files to appear in search results, you need to build a custom index and populate it when files are uploaded or updated.
What Was Available in the Umbraco 7 Era
In the Umbraco 7 days the main option was the Cogworks ExamineFileIndexer package, which used Apache Tika under the hood to extract text from .docx, .pdf, and other file types. It worked, but it required manually copying Tika DLLs into your bin folder and had a dependency on a Java-based library via IKVM, which made deployment more involved than most developers wanted. The last release supported Umbraco 7.14.x and it was never updated for modern Umbraco.
The Gap in Modern Umbraco
When Umbraco moved to the modern .NET architecture from version 9 onwards, the old Cogworks package was left behind. The official UmbracoExamine.PDF package covers PDF files well, but there has been no equivalent for Word, Excel, or PowerPoint documents using standard Lucene-based Examine. A GitHub issue raised against the Umbraco CMS repository as far back as 2018 explicitly requested an Office document equivalent of UmbracoExamine.PDF, and nothing has appeared from Umbraco HQ or the community since.
The one exception worth mentioning is ExamineX, which can index Office documents as part of a much larger proposition: it replaces Lucene entirely with Azure AI Search, requires Azure Blob Storage as the media provider, and carries a commercial licence at $1,000 per year per site. It is an excellent product for enterprise sites on Azure, but it is solving a different set of problems. If you are running a standard Umbraco site on Lucene and just want Office document search to work, that is significant infrastructure and cost overhead for what should be a straightforward feature.
Umbraco.Community.Examine.OpenXml fills that gap: a free, open-source, self-contained package that works with standard Lucene-based Examine and requires nothing beyond a single NuGet install.
How It Works
The package hooks into Umbraco's media events and uses the DocumentFormat.OpenXml SDK to extract text from Office documents when they are uploaded or updated. The extracted text is stored in a dedicated Lucene-based Examine index called OpenXmlIndex. The index stays in sync automatically as media items are created, updated, or deleted, and it can be manually rebuilt from the Examine Management dashboard in the backoffice.
The approach is based directly on UmbracoExamine.PDF. If you have used that package, this one will feel immediately familiar.
Supported File Types
Extension | Type |
|---|---|
| Word documents |
| PowerPoint presentations |
| Excel spreadsheets |
Supported Umbraco Versions
Umbraco | .NET | Status |
|---|---|---|
13.x | .NET 8 | Supported |
14.x | .NET 8 | Not supported (EOL) |
15.x | .NET 9 | Not supported (EOL) |
16.x | .NET 9 | Supported |
17.x | .NET 10 | Supported |
Versions 14 and 15 reached end of life and are not supported. If you are on either of those versions, now is a good time to plan your upgrade to 16 or 17.
Installation
Install the package from NuGet:
dotnet add package Umbraco.Community.Examine.OpenXmlOr via the Package Manager Console:
NuGet\Install-Package Umbraco.Community.Examine.OpenXmlThat is genuinely all there is to it. The package registers itself automatically using an Umbraco composer, so there is no additional startup configuration required. Once you restart the site the OpenXmlIndex will be available immediately. Any Office documents already in your media library can be indexed by triggering a rebuild from the Examine Management dashboard in the backoffice.
Searching the Index
Once the index is populated you can query it from any Razor view or controller using Examine's IExamineManager. The package exposes an OpenXmlIndexConstants class with strongly-typed constants so you do not have to remember magic strings.
Basic Search Example
Here is a straightforward example in a Razor view:
@using Examine
@using Examine.Search
@using Umbraco.Community.Examine.OpenXml
@inject IExamineManager ExamineManager
@{
var searchQuery = Context.Request.Query["q"].ToString();
}
@if (!string.IsNullOrWhiteSpace(searchQuery))
{
if (ExamineManager.TryGetIndex(OpenXmlIndexConstants.OpenXmlIndexName, out var index))
{
var searcher = index.Searcher;
var query = searcher.CreateQuery(OpenXmlIndexConstants.OpenXmlCategory)
.GroupedOr(
new[] { OpenXmlIndexConstants.OpenXmlContentFieldName, "nodeName" },
searchQuery
);
var results = query.Execute();
<p>Found @results.TotalItemCount result(s) for "@searchQuery"</p>
foreach (var result in results)
{
var name = result.Values.ContainsKey("nodeName")
? result.Values["nodeName"]
: "Unknown";
<div>
<h2>@name</h2>
<p>Score: @result.Score.ToString("F2")</p>
</div>
}
}
}Available Constants
The OpenXmlIndexConstants class provides the following constants:
Constant | Value | Purpose |
|---|---|---|
|
| Passed to |
|
| The field containing the extracted document text |
|
| Scopes the query to OpenXml documents |
Combining with Other Indexes
If you are also using UmbracoExamine.PDF alongside this package, you can create a multi-searcher to query both indexes at once with a single search call:
services.AddExamineLuceneMultiSearcher(
"MediaSearcher",
new[]
{
PdfIndexConstants.PdfIndexName,
OpenXmlIndexConstants.OpenXmlIndexName
}
);You can then inject IExamineManager, retrieve the MediaSearcher, and search across PDF and Office documents in one query.
Extraction Limits
To protect against malicious or oversized documents, the package applies limits during text extraction. Documents that exceed any of these limits are logged as warnings and excluded from the index rather than causing errors.
Limit | Value | Description |
|---|---|---|
Max file size | 100 MB | Files exceeding this size are skipped entirely |
Max extracted content | 10 MB | Text extraction stops once this limit is reached |
Max characters per part | 10,000,000 | Limits characters per OpenXml document part to prevent decompression bombs |
Max shared strings (Excel) | 1,000,000 | Caps the number of shared string entries loaded from |
The Excel shared strings limit is worth calling out specifically. Excel stores cell text in a shared strings table rather than inline in each cell, so very large spreadsheets with a high volume of unique string values could otherwise consume a significant amount of memory during indexing. The cap keeps that under control without affecting typical real-world documents.
All of these values are defined in the OpenXmlIndexConstants class, so if you need to reference them in your own code they are available there.
Why OpenXml and Not Tika
One of the reasons I chose the OpenXml SDK over something like Apache Tika is that it is a pure .NET library with no external runtime dependencies. There is nothing to copy into your bin folder, no Java, and no additional infrastructure to manage. An OpenXML document is just a ZIP file containing XML parts, and the SDK gives you clean programmatic access to those parts. For Word documents you read the main document part, for PowerPoint you iterate the slide parts, and for Excel you read the shared strings table and cells across each worksheet. The extracted text is concatenated into a single string which Lucene then tokenises and indexes in the normal way.
Source Code
The full source code is on GitHub at github.com/justin-nevitech/Umbraco.Community.Examine.OpenXml. If you have been using the original code from the 2023 post and maintaining it yourself in your own solution, I would encourage you to switch to the package. It is easier to keep up to date and means you get any bug fixes without having to carry the code yourself.
Issues and pull requests are welcome.