'hold my beer' with Sitecore

Sitecore 9.1 release has brought many useful features including Cortex - widely advertised integration and support for AI and ML. For now it covers 2 features: component personalisation suggestions and content tagging. The latter comes with an OOTB integration with OpenCalais API. This service provided by Thomson Reuters provides categorisation suggestions based on the meaning of provided content. Combined with Sitecore, it automatically assigns tags based on item's text fields content.

Unfortunately, it's fairly simple and does not cover tagging based on the image content. However, with the help of Microsoft Cognitive Services (Tag Image endpoint to be precise) we can make it work and get a collection of relevant tags.

Let's get to work. To start we need to figure out OOTB tagging architecture and how it works, so a quick look at Sitecore docs gives us a brief idea of what we need. What we'll do is extending the current content tagging pipeline, so that processing a media item will not only use OpenCalais (which must be active) with item's text fields to produce the tags, but also use the MS Cognitive Services to analyze the image and describe it with appropriate tags.

Precisely, what we need are 2 new providers:

Content provider extracting the string of relevant information out of the item for further analysis
Discovery provider containing the business logic behind processing extracted data and producing tags

Ideally, there should be also an initial validation processor encapsulating all the checks and aborting the pipeline in case of any issues (missing configuration etc.) or if the item is not a good fit for our custom tagging. In our case it's just a functional extension, so performing such an abortion would stop the whole Content Tagging pipeline, not just our additional functionality resulting in no tags assigned at all. Because of that and to simplify the code, I decided to move those checks to Content and Discovery providers. What's more, not all 4 of providers available to be put in the pipeline have to be implemented. That means that Content and Discovery ones will add extra image tags returned by the API for a further, standard processing performed by the latter 2 providers. The configuration of such extension looks as follows:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:role="http://www.sitecore.net/xmlconfig/role/">
  <sitecore>
    <contentTagging>
      <configurations>
        <config name="Default">
          <content>
            <provider name="ComputerVisionContentProvider"/>
          </content>
          <discovery>
            <provider name="ComputerVisionDiscoveryProvider"/>
          </discovery>
        </config>
      </configurations>
      <providers>
        <content>
          <add name="ComputerVisionContentProvider" type="Sitecore91.Foundation.Tagging.Providers.ComputerVisionContentProvider, Sitecore91.Foundation.Tagging" />
        </content>
        <discovery>
          <add name="ComputerVisionDiscoveryProvider" type="Sitecore91.Foundation.Tagging.Providers.ComputerVisionDiscoveryProvider, Sitecore91.Foundation.Tagging" />
        </discovery>
      </providers>
    </contentTagging>
  </sitecore>
</configuration>

Now, here's the Content provider extracting the taggable data from the item. Standard process is extraction of the item text fields and sending them down the pipeline for conversion into tags. In our case to tightly follow that purpose we'd have to serialize the image into byte array. Let's go for a different approach, extract just the image ID and pass it further so that we'll delegate and encapsulate this task within Discovery provider. The only check we make is if it's a media item which does not eliminate, but strongly reduce the number of unsupported items left for further processing.

using Sitecore.ContentTagging.Core.Models;
using Sitecore.ContentTagging.Core.Providers;
using Sitecore.Data.Items;

namespace Sitecore91.Foundation.Tagging.Providers
{
    public class ComputerVisionContentProvider : IContentProvider<Item>
    {
        public TaggableContent GetContent(Item item)
        {
            var stringContent = new StringContent();
            if (item.Paths.IsMediaItem)
            {
                stringContent.Content = item.ID.ToString();
            }
            return stringContent;
        }
    }
}

To align the business logic with the approach taken in Content provider, we need think on how to handle the 'special' case of processing the image item. Just to remind: the providers described here are just extending the existing pipeline, so we focus on handling this case only exclusively by processing the taggable content only if:

all it contains is an item ID (it's been processed by our custom Content provider)
the item is a media item (in this use case we're tagging images from media library only)
the media item is an image (as only images will get correctly tagged by the service)

After that we just serialize such image and pass it to the service. Please keep in mind that we also have the other 'Tag' implementation accepting image URL which would make the API calls much faster. I just used the other overload as found it easier to use in development environment.

I also made use of the 'confidence' parameter to filter out the tags with less than 50% of confidence to filter out the potentially irrelevant tags.

using System.Collections.Generic;
using System.Drawing;
using System.Linq;
using Microsoft.Extensions.DependencyInjection;
using Sitecore.ContentTagging.Core.Models;
using Sitecore.ContentTagging.Core.Providers;
using Sitecore.Data;
using Sitecore.Data.Items;
using Sitecore.DependencyInjection;
using Sitecore91.Foundation.CognitiveServices.ComputerVision;

namespace Sitecore91.Foundation.Tagging.Providers
{
    public class ComputerVisionDiscoveryProvider : IDiscoveryProvider
    {
        public IEnumerable<TagData> GetTags(IEnumerable<TaggableContent> content)
        {
            var computerVisionService = ServiceLocator.ServiceProvider.GetService<IComputerVisionService>();

            var tagDataList = new List<TagData>();
            foreach (StringContent stringContent in content)
            {
                if (string.IsNullOrEmpty(stringContent?.Content) || !ID.TryParse(stringContent.Content, out var itemId))
                    continue;

                var item = Sitecore.Configuration.Factory.GetDatabase("master").GetItem(itemId);
                if (!item.Paths.IsMediaItem)
                    continue;

                var mediaItem = new MediaItem(item);
                if (!mediaItem.MimeType.StartsWith("image/"))
                    continue;

                var image = (byte[]) new ImageConverter().ConvertTo(Image.FromStream(mediaItem.GetMediaStream()), typeof(byte[]));
                var result = computerVisionService.Analyze(image, "tags", "", "en");
                var tags = result.tags.Where(x => x.confidence > 0.5).Select(x => x.name);

                tagDataList.AddRange(tags.Select(tag => new TagData {TagName = tag}));
            }
            return tagDataList;
        }

        public bool IsConfigured()
        {
            return true;
        }
    }
}

From implementation point of view that's it - 2 providers extending the pipeline. All you have to do now to test the solution is to select a media item in Content Editor, select the Home tab in the ribbon and click Tag Item in Content Tagging section. After several seconds of processing of our sample image:

You can refresh the 'Tagging' section to see the following tags:

That aligns with the collection generated in my previous post where we directly hit the service with this exact image. We have 1 tag missing comparing with the previous tag set and that's the result of filtering using the 'confidence' field.

The challenge sounds simple: How to generate a collection of keywords describing an image?

By description I don't mean its metadata like width or MIME type, but rather the actual content of it. Of course, solutions like asking your mom to spend her afternoon doing that manually for you are not considered good practice - especially when talking about thousands of images and you hoping for any presents next Christmas.

Talking seriously, this task sounds ideal for usage of AI/ML which have become very hot topics these days and are trending to be applicable in a rapidly growing number of areas. I have already partially covered this topic providing a good example of Microsoft Cognitive Services utilisation in my previous post. I encourage you to read it if you'd like to broaden your knowledge around usage of it with Sitecore and SXA.

To get to the point and answer the question asked in the first caption: let's make use of Tag Image endpoint to get a collection of relevant tags. In order to have some flexibility with applying the tagging solution let's prepare 2 implementations: one accepting a serialized image while the other a URL to an image available online:

using Sitecore91.Foundation.CognitiveServices.Models;

namespace Sitecore91.Foundation.CognitiveServices.ComputerVision
{
    public interface IComputerVisionService
    {
        TagModel Tag(byte[] image, string language);
        TagModel Tag(string imageUrl, string language);
    }
}

The TagModel class is a C# representation of the API JSON result:

using System.Collections.Generic;

namespace Sitecore91.Foundation.CognitiveServices.Models
{
    public class TagModel
    {
        public List<Tag> tags { get; set; }
        public string requestId { get; set; }
    }

    public class Tag
    {
        public string name { get; set; }
        public double confidence { get; set; }
    }
}

TIP: This model is fairy simple, but for more complex JSON structures you can simply generate it yourself by processing a sample success JSON result from Image Tag docs with json2sharp. Just take a look at different endpoints like Analyze Image where the JSON result maps to a structure of almost 20 C# classes.

Now, here's the service:

using System;
using System.Net.Http;
using System.Net.Http.Headers;
using Sitecore.Diagnostics;
 
namespace Sitecore91.Foundation.CognitiveServices.ComputerVision
{
    public class ComputerVisionService : IComputerVisionService
    {
        private readonly string _subscriptionKey = "<-SUBSCRIPTION KEY->";
        private readonly string _serviceUrl = "https://<-AZURE REGION->.api.cognitive.microsoft.com/vision/v2.0/";
 
        public TagModel Tag(byte[] image, string language)
        {
            var apiMethod = "tag";
            var requestUri = _serviceUrl + apiMethod + $"?language={language}";

            try
            {
                using (var client = new HttpClient())
                {
                    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

                    using (var content = new ByteArrayContent(image))
                    {
                        content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
                        using (var response = client.PostAsync(requestUri, content).GetAwaiter().GetResult())
                        {
                            if (response.IsSuccessStatusCode)
                            {
                                var result = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                                return JsonConvert.DeserializeObject<TagModel>(result);
                            }
                            var errorMessage = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                            Log.Error(errorMessage, this);
                            return null;
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.Message, this);
                return null;
            }
        }

        public TagModel Tag(string imageUrl, string language)
        {
            var apiMethod = "tag";
            var requestUri = _serviceUrl + apiMethod + $"?language={language}";

            try
            {
                using (var client = new HttpClient())
                {
                    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

                    using (var content = new StringContent($"{{url:\"{imageUrl}\"}}"))
                    {
                        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
                        using (var response = client.PostAsync(requestUri, content).GetAwaiter().GetResult())
                        {
                            if (response.IsSuccessStatusCode)
                            {
                                var result = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                                return JsonConvert.DeserializeObject<TagModel>(result);
                            }
                            var errorMessage = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                            Log.Error(errorMessage, this);
                            return null;
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.Message, this);
                return null;
            }
        }
    }
}

If you took a look at my previous post mentioned before, you'll find this piece of code looking almost identical. We'll, in fact all we do is just another API call with some JSON result mapping afterwards. To test if it works fine I prepared a sample controller action registered with custom MVC routing:

public ActionResult TestTagging()
{
    var computerVisionService = new ComputerVisionService();
    var imageUrl = "https://images.pexels.com/photos/849835/pexels-photo-849835.jpeg";
    var model = computerVisionService.Tag(imageUrl, "en");

    return new ContentResult { Content = string.Join("<br />", model.tags.Select(x => $"{x.name}:{x.confidence}")) };
}

Now, for the image below:

We get the following collection of tags returned by the service:

sky:0.996140539646149
truck:0.954374551773071
car:0.951652050018311
outdoor:0.936659157276154
snow:0.905518352985382
blue:0.871806204319
transport:0.659421741962433
winter:0.175758534148281

The float value from 0.0-1.0 range next to each tag is its 'confidence', representing how 'sure' the AI is about the relevance of each assigned tag. It's very useful for accurate tagging, as finding a correct threshold allows you to tune the tagging relevance for your own needs.

by Maciej Gontarz

Cognitive Services + Sitecore: Smart image tagging with AI (part 2)

Cognitive Services: Smart image tagging with AI (part 1)