Cognitive Services + Sitecore: Smart image tagging with AI (part 2)

Sitecore 9.1 release has brought many useful features including Cortex - widely advertised integration and support for AI and ML. For now it covers 2 features: component personalisation suggestions and content tagging. The latter comes with an OOTB integration with OpenCalais API. This service provided by Thomson Reuters provides categorisation suggestions based on the meaning of provided content. Combined with Sitecore, it automatically assigns tags based on item's text fields content.

Unfortunately, it's fairly simple and does not cover tagging based on the image content. However, with the help of Microsoft Cognitive Services (Tag Image endpoint to be precise) we can make it work and get a collection of relevant tags.

Let's get to work. To start we need to figure out OOTB tagging architecture and how it works, so a quick look at Sitecore docs gives us a brief idea of what we need. What we'll do is extending the current content tagging pipeline, so that processing a media item will not only use OpenCalais (which must be active) with item's text fields to produce the tags, but also use the MS Cognitive Services to analyze the image and describe it with appropriate tags.

Precisely, what we need are 2 new providers:

  • Content provider extracting the string of relevant information out of the item for further analysis
  • Discovery provider containing the business logic behind processing extracted data and producing tags

Ideally, there should be also an initial validation processor encapsulating all the checks and aborting the pipeline in case of any issues (missing configuration etc.) or if the item is not a good fit for our custom tagging. In our case it's just a functional extension, so performing such an abortion would stop the whole Content Tagging pipeline, not just our additional functionality resulting in no tags assigned at all. Because of that and to simplify the code, I decided to move those checks to Content and Discovery providers. What's more, not all 4 of providers available to be put in the pipeline have to be implemented. That means that Content and Discovery ones will add extra image tags returned by the API for a further, standard processing performed by the latter 2 providers. The configuration of such extension looks as follows:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:role="http://www.sitecore.net/xmlconfig/role/">
  <sitecore>
    <contentTagging>
      <configurations>
        <config name="Default">
          <content>
            <provider name="ComputerVisionContentProvider"/>
          </content>
          <discovery>
            <provider name="ComputerVisionDiscoveryProvider"/>
          </discovery>
        </config>
      </configurations>
      <providers>
        <content>
          <add name="ComputerVisionContentProvider" type="Sitecore91.Foundation.Tagging.Providers.ComputerVisionContentProvider, Sitecore91.Foundation.Tagging" />
        </content>
        <discovery>
          <add name="ComputerVisionDiscoveryProvider" type="Sitecore91.Foundation.Tagging.Providers.ComputerVisionDiscoveryProvider, Sitecore91.Foundation.Tagging" />
        </discovery>
      </providers>
    </contentTagging>
  </sitecore>
</configuration>

Now, here's the Content provider extracting the taggable data from the item. Standard process is extraction of the item text fields and sending them down the pipeline for conversion into tags. In our case to tightly follow that purpose we'd have to serialize the image into byte array. Let's go for a different approach, extract just the image ID and pass it further so that we'll delegate and encapsulate this task within Discovery provider. The only check we make is if it's a media item which does not eliminate, but strongly reduce the number of unsupported items left for further processing.

using Sitecore.ContentTagging.Core.Models;
using Sitecore.ContentTagging.Core.Providers;
using Sitecore.Data.Items;

namespace Sitecore91.Foundation.Tagging.Providers
{
    public class ComputerVisionContentProvider : IContentProvider<Item>
    {
        public TaggableContent GetContent(Item item)
        {
            var stringContent = new StringContent();
            if (item.Paths.IsMediaItem)
            {
                stringContent.Content = item.ID.ToString();
            }
            return stringContent;
        }
    }
}

To align the business logic with the approach taken in Content provider, we need think on how to handle the 'special' case of processing the image item. Just to remind: the providers described here are just extending the existing pipeline, so we focus on handling this case only exclusively by processing the taggable content only if:

  • all it contains is an item ID (it's been processed by our custom Content provider)
  • the item is a media item (in this use case we're tagging images from media library only)
  • the media item is an image (as only images will get correctly tagged by the service)

After that we just serialize such image and pass it to the service. Please keep in mind that we also have the other 'Tag' implementation accepting image URL which would make the API calls much faster. I just used the other overload as found it easier to use in development environment.

I also made use of the 'confidence' parameter to filter out the tags with less than 50% of confidence to filter out the potentially irrelevant tags.

using System.Collections.Generic;
using System.Drawing;
using System.Linq;
using Microsoft.Extensions.DependencyInjection;
using Sitecore.ContentTagging.Core.Models;
using Sitecore.ContentTagging.Core.Providers;
using Sitecore.Data;
using Sitecore.Data.Items;
using Sitecore.DependencyInjection;
using Sitecore91.Foundation.CognitiveServices.ComputerVision;

namespace Sitecore91.Foundation.Tagging.Providers
{
    public class ComputerVisionDiscoveryProvider : IDiscoveryProvider
    {
        public IEnumerable<TagData> GetTags(IEnumerable<TaggableContent> content)
        {
            var computerVisionService = ServiceLocator.ServiceProvider.GetService<IComputerVisionService>();

            var tagDataList = new List<TagData>();
            foreach (StringContent stringContent in content)
            {
                if (string.IsNullOrEmpty(stringContent?.Content) || !ID.TryParse(stringContent.Content, out var itemId))
                    continue;

                var item = Sitecore.Configuration.Factory.GetDatabase("master").GetItem(itemId);
                if (!item.Paths.IsMediaItem)
                    continue;

                var mediaItem = new MediaItem(item);
                if (!mediaItem.MimeType.StartsWith("image/"))
                    continue;

                var image = (byte[]) new ImageConverter().ConvertTo(Image.FromStream(mediaItem.GetMediaStream()), typeof(byte[]));
                var result = computerVisionService.Analyze(image, "tags", "", "en");
                var tags = result.tags.Where(x => x.confidence > 0.5).Select(x => x.name);

                tagDataList.AddRange(tags.Select(tag => new TagData {TagName = tag}));
            }
            return tagDataList;
        }

        public bool IsConfigured()
        {
            return true;
        }
    }
}

From implementation point of view that's it - 2 providers extending the pipeline. All you have to do now to test the solution is to select a media item in Content Editor, select the Home tab in the ribbon and click Tag Item in Content Tagging section. After several seconds of processing of our sample image:

You can refresh the 'Tagging' section to see the following tags:

That aligns with the collection generated in my previous post where we directly hit the service with this exact image. We have 1 tag missing comparing with the previous tag set and that's the result of filtering using the 'confidence' field.