Introduction

This project actually started on a workshop with a client where we talked about how to make a network share – containing documents like Word, Excel and PDF – searchable for employees and at the same time keep track of the file access control (ACL) while searching.

I´m one of those who still has a NAS at home, backing up all of these thousands of images we’ve been producing over the years. There are also a bunch of documents which hasn´t reach into the cloud, yet. Here we will be indexing files of type PDF on a NAS over the local network.

Prerequisites

Document definition

The document object type will look like this

public class Document
{
    public Document()
    {
        Id = Guid.NewGuid();
        Acl = new List<string>();
    }

    public Guid Id { get; set; }
    public string Path { get; set; }
    public IList<string> Acl { get; set; }
    public Attachment Attachment { get; set; }

    public string Content { get; set; }
}

It contains a unique id for identification, a path where the physical file is located on disk, the access control list, the attachment that will be indexed and the content which temporary holds the actual file content as a base64 encoded string.

Creating index and mappings

Next step is to create an index and mappings for it. For working with Elasticsearch we will be using NEST – which provides strongly typed query DSL (Domain Specific Language).

var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var settings = new ConnectionSettings(pool)
    .DefaultIndex(defaultIndex));
var client = new ElasticClient(settings);

var createIndexResponse = client.Indices.Create(defaultIndex, c => c
    .Settings(s => s       
        .Analysis(a => a
            .Analyzers(ad => ad
                .Custom("windows_path_hierarchy_analyzer", ca => ca
                    .Tokenizer("windows_path_hierarchy_tokenizer")
                )
            )
            .Tokenizers(t => t
                .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
                    .Delimiter('\\')
                )
            )
        )
    )
    .Map<Document>(mp => mp
        .AutoMap()
        .Properties(ps => ps
            .Text(s => s
                .Name(n => n.Path)
                .Analyzer("windows_path_hierarchy_analyzer")
            )
            .Text(s => s
                .Name(n => n.Acl)
                .Analyzer("windows_path_hierarchy_analyzer")
            )
            .Object<Attachment>(a => a
                .Name(n => n.Attachment)
                .AutoMap()
            )
        )
    )
);

The path hierarchy tokenizer is used on the path and the acl properties to provide search across path hierarchies. Because this example is running on Windows the delimiter for this will be \ character.

Ingest pipeline

To take advantage of the ingest attachment processor plugin to extract file content, we need to create an ingest pipeline. Here we´re taking the base64 encoded value of the content and target it on the attachment field. Further on removing the temporary content field as we don´t need the binary data in the index, only the extracted text content. Note: changes made using this API take effect immediately so its possible to update pipeline at runtime.

var putPipelineResponse = client.Ingest.PutPipeline("attachments", p => p
    .Description("Document attachment pipeline")
    .Processors(pd => pd
        .Attachment<Document>(a => a
            .Field(doc => doc.Content)
            .TargetField(doc => doc.Attachment)
        )
        .Remove<Document>(r => r
            .Field(fd => fd
                .Field(doc => doc.Content)
            )
        )
    )
);

Indexing files

The first step is to find files on the NAS and here we are using DirectoryInfo.EnumerateFiles with the all directories search option to include the root directory and all its subdirectories. The collection of files are also filtered by a specified array of file extensions. Here we´re extracting file access rights and reading the file bytes into the temporary content property.

static IList<Document> FindFiles(string rootPath, string[] extensions)
{
    var files = new List<Document>();

    foreach (var fileInfo in new DirectoryInfo(rootPath)
        .EnumerateFiles("*.*", SearchOption.AllDirectories)
        .Where(fi => extensions.Any(ext => ext == fi.Extension.ToLower())))
    {
        var file = new Document
        {
            Path = fileInfo.FullName,
            Content = Convert.ToBase64String(File.ReadAllBytes(fileInfo.FullName))
        };

        var fileSecurity = fileInfo.GetAccessControl();
        var authAccessRules = fileSecurity.GetAccessRules(
            includeExplicit: true,
            includeInherited: true,
            targetType: typeof(NTAccount));

        foreach (FileSystemAccessRule fsaRule in authAccessRules)
        {
            if ((FileSystemRights.Read & fsaRule.FileSystemRights) != FileSystemRights.Read)
                continue;

            if (fsaRule.AccessControlType == AccessControlType.Allow)
                file.Acl.Add(fsaRule.IdentityReference.Value);
        }

        files.Add(file);
    }

    return files;
}

Next step is to index files using the created attachment pipeline. To do so we add the pipeline index descriptor which points to the attachment pipeline created.

var files = FindFiles("\\192.168.0.29\shared\files", new [] { ".pdf" });
foreach (var file in files)
{
    client.Index(file, i => i
        .Pipeline("attachments")
        .Refresh(Refresh.WaitFor));
}

Searching

Searching the content of the attachments with respect to the ACL, we will use a boolean query. A must query is used for the search term and for the ACL part we´re using a filter. Filter clauses are executed in filter context, meaning it doesn´t affects the scoring of the matching documents. That’s what we want here.

var term = "azure";
var acl = new [] { "MyComputer\\TestGroup", "BUILTIN\\IIS_IUSRS" };

var searchResponse = client.Search<Document>(s => s
    .Query(q => q
        .Bool(b => b
            .Must(m => m
                .Term(t => t.Attachment.Content, term)
            )
            .Filter(f => f
                .Terms(t => t
                    .Field(d => d.Acl)
                    .Terms(acl)
                )
            )
        )
    )
);

The search result returned for the query is as follows

{
    "took": 7,
    "timed_out": false,
    "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": {
        "value": 1,
        "relation": "eq"
      },
      "max_score": 1.5210183,
      "hits": [
        {
          "_index": "attachments",
          "_type": "_doc",
          "_id": "7a455f92-bb66-4f02-a4d8-47ae0a07a2f2",
          "_score": 1.5210183,
          "_source": {
            "path": "\\192.168.0.29\\shared\\files\\The Developer's Guide to Azure.pdf",
            "attachment": {
              "date": "2020-07-21T08:57:48Z",
              "content_type": "application/pdf",
              "language": "en",
              "title": "The Developer’s Guide to Azure",
              "content": "E-book Series\nE-book Series\n\nThe Developer’s \nGuide to Azure [...]",
              "content_length": 100000
            },
            "id": "7a455f92-bb66-4f02-a4d8-47ae0a07a2f2",
            "acl": [
              "MyComputer\\TestGroup",
              "BUILTIN\\Administrators",
              "NT AUTHORITY\\SYSTEM",
              "BUILTIN\\Users",
              "NT AUTHORITY\\Authenticated Users"
            ]
          }
        }
      ]
    }
  }

Next thing to do will be to wrap indexing of documents into a scheduler. Maybe a Windows service or a Hangfire job?

Categories: Development