Cognitive Services: Smart image tagging with AI (part 1)

The challenge sounds simple: How to generate a collection of keywords describing an image?

By description I don't mean its metadata like width or MIME type, but rather the actual content of it. Of course, solutions like asking your mom to spend her afternoon doing that manually for you are not considered good practice - especially when talking about thousands of images and you hoping for any presents next Christmas.

Talking seriously, this task sounds ideal for usage of AI/ML which have become very hot topics these days and are trending to be applicable in a rapidly growing number of areas. I have already partially covered this topic providing a good example of Microsoft Cognitive Services utilisation in my previous post. I encourage you to read it if you'd like to broaden your knowledge around usage of it with Sitecore and SXA.

To get to the point and answer the question asked in the first caption: let's make use of Tag Image endpoint to get a collection of relevant tags. In order to have some flexibility with applying the tagging solution let's prepare 2 implementations: one accepting a serialized image while the other a URL to an image available online:

using Sitecore91.Foundation.CognitiveServices.Models;

namespace Sitecore91.Foundation.CognitiveServices.ComputerVision
{
    public interface IComputerVisionService
    {
        TagModel Tag(byte[] image, string language);
        TagModel Tag(string imageUrl, string language);
    }
}

The TagModel class is a C# representation of the API JSON result:

using System.Collections.Generic;

namespace Sitecore91.Foundation.CognitiveServices.Models
{
    public class TagModel
    {
        public List<Tag> tags { get; set; }
        public string requestId { get; set; }
    }

    public class Tag
    {
        public string name { get; set; }
        public double confidence { get; set; }
    }
}

TIP: This model is fairy simple, but for more complex JSON structures you can simply generate it yourself by processing a sample success JSON result from Image Tag docs with json2sharp. Just take a look at different endpoints like Analyze Image where the JSON result maps to a structure of almost 20 C# classes.

Now, here's the service:

using System;
using System.Net.Http;
using System.Net.Http.Headers;
using Sitecore.Diagnostics;
 
namespace Sitecore91.Foundation.CognitiveServices.ComputerVision
{
    public class ComputerVisionService : IComputerVisionService
    {
        private readonly string _subscriptionKey = "<-SUBSCRIPTION KEY->";
        private readonly string _serviceUrl = "https://<-AZURE REGION->.api.cognitive.microsoft.com/vision/v2.0/";
 
        public TagModel Tag(byte[] image, string language)
        {
            var apiMethod = "tag";
            var requestUri = _serviceUrl + apiMethod + $"?language={language}";

            try
            {
                using (var client = new HttpClient())
                {
                    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

                    using (var content = new ByteArrayContent(image))
                    {
                        content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
                        using (var response = client.PostAsync(requestUri, content).GetAwaiter().GetResult())
                        {
                            if (response.IsSuccessStatusCode)
                            {
                                var result = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                                return JsonConvert.DeserializeObject<TagModel>(result);
                            }
                            var errorMessage = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                            Log.Error(errorMessage, this);
                            return null;
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.Message, this);
                return null;
            }
        }

        public TagModel Tag(string imageUrl, string language)
        {
            var apiMethod = "tag";
            var requestUri = _serviceUrl + apiMethod + $"?language={language}";

            try
            {
                using (var client = new HttpClient())
                {
                    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

                    using (var content = new StringContent($"{{url:\"{imageUrl}\"}}"))
                    {
                        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
                        using (var response = client.PostAsync(requestUri, content).GetAwaiter().GetResult())
                        {
                            if (response.IsSuccessStatusCode)
                            {
                                var result = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                                return JsonConvert.DeserializeObject<TagModel>(result);
                            }
                            var errorMessage = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                            Log.Error(errorMessage, this);
                            return null;
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.Message, this);
                return null;
            }
        }
    }
}

If you took a look at my previous post mentioned before, you'll find this piece of code looking almost identical. We'll, in fact all we do is just another API call with some JSON result mapping afterwards. To test if it works fine I prepared a sample controller action registered with custom MVC routing:

public ActionResult TestTagging()
{
    var computerVisionService = new ComputerVisionService();
    var imageUrl = "https://images.pexels.com/photos/849835/pexels-photo-849835.jpeg";
    var model = computerVisionService.Tag(imageUrl, "en");

    return new ContentResult { Content = string.Join("<br />", model.tags.Select(x => $"{x.name}:{x.confidence}")) };
}

 

Now, for the image below:

We get the following collection of tags returned by the service:

sky:0.996140539646149
truck:0.954374551773071
car:0.951652050018311
outdoor:0.936659157276154
snow:0.905518352985382
blue:0.871806204319
transport:0.659421741962433
winter:0.175758534148281

The float value from 0.0-1.0 range next to each tag is its 'confidence', representing how 'sure' the AI is about the relevance of each assigned tag. It's very useful for accurate tagging, as finding a correct threshold allows you to tune the tagging relevance for your own needs.

Cognitive Services + SXA: Custom Rendering Variant field

In my previous post we utilised MS Cognitive Services to assist with cropping images smart way on the fly. The reason for that was to reuse a single HQ image in different display size scenarios (pages / devices) focused on image's centre of interest rather than creating and maintaining a manually cropped collection.

Now, as we have the service implementation in place let's integrate it with SXA. Considering the purpose described above, a really good usage example is creating a new Rendering Variant field to process the referenced image according to provided size and cropping mode.

 

1. To start, create a Rendering Variant field template and tailor it for service's purpose by adding the minimal set of values needed to utilise the previously developed service:

 

2. Then, add this newly defined field to one of the Rendering Variants containing an image to crop:

 

Now some code backing up the functionality of created items:

3. In general, what we need to do is extending parseVariantFields and renderVariantField pipelines:

<?xml version="1.0" encoding="utf-8" ?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <pipelines>
      <parseVariantFields>
        <processor type="Sitecore91.Foundation.SXAExtensions.Pipelines.VariantFields.SmartThumbnail.ParseSmartThumbnail, Sitecore91.Foundation.SXAExtensions" resolve="true"/>
      </parseVariantFields>
      <renderVariantField>
        <processor type="Sitecore91.Foundation.SXAExtensions.Pipelines.VariantFields.SmartThumbnail.RenderSmartThumbnail, Sitecore91.Foundation.SXAExtensions" resolve="true"/>
      </renderVariantField>
    </pipelines>
  </sitecore>
</configuration>

 

4. Then implement a new Rendering Variant field to be processed by those pipelines:

using Sitecore.Data.Items;
using Sitecore.XA.Foundation.RenderingVariants.Fields;

namespace Sitecore91.Foundation.SXAExtensions.Pipelines.VariantFields.SmartThumbnail
{
    public class SmartThumbnailVariant : RenderingVariantFieldBase
    {
        public int Width { get; set; }
        public int Height { get; set; }
        public bool IsSmartCrop { get; set; }

        public SmartThumbnailVariant(Item variantItem) : base(variantItem) { }
    }
}

 

5. Followed by the processor responsible for parsing the Rendering Variant field:

using Sitecore.Data;
using Sitecore.XA.Foundation.Variants.Abstractions.Pipelines.ParseVariantFields;

namespace Sitecore91.Foundation.SXAExtensions.Pipelines.VariantFields.SmartThumbnail
{
    public class ParseSmartThumbnail : ParseVariantFieldProcessor
    {
        public override ID SupportedTemplateId => Constants.RenderingVariants.SmartThumbnail.Fields.SmartThumbnailVariant;

        public override void TranslateField(ParseVariantFieldArgs args)
        {
            var variantFieldsArgs = args;

            var smartThumbnail = new SmartThumbnailVariant(args.VariantItem)
            {
                ItemName = args.VariantItem.Name,
                FieldName = args.VariantItem.Fields[Constants.RenderingVariants.SmartThumbnail.Fields.FieldName].GetValue(true),
                Width = int.TryParse(args.VariantItem.Fields[Constants.RenderingVariants.SmartThumbnail.Fields.Width].GetValue(true), out var width) ? width : 0,
                Height = int.TryParse(args.VariantItem.Fields[Constants.RenderingVariants.SmartThumbnail.Fields.Height].GetValue(true), out var height) ? height : 0,
                IsSmartCrop = int.TryParse(args.VariantItem.Fields[Constants.RenderingVariants.SmartThumbnail.Fields.IsSmartCrop].GetValue(true), out var smartCropValue) && smartCropValue == 1
            };
            variantFieldsArgs.TranslatedField = smartThumbnail;
        }
    }
}

Where Constants is a standard helper class with static IDs:

using Sitecore.Data;

namespace Sitecore91.Foundation.SXAExtensions
{
    public struct Constants
    {
        public struct RenderingVariants
        {
            public struct SmartThumbnail
            {
                public struct Fields
                {
                    public static ID SmartThumbnailVariant => new ID("{00458333-70A1-4D52-B375-62CBE13575CD}");
                    public static ID FieldName => new ID("{0B00BC72-0C1C-4A49-8C94-297E38E511E7}");
                    public static ID Width => new ID("{51A093C8-A516-4A70-8223-7B907DCB0958}");
                    public static ID Height => new ID("{6447EF7A-D84E-4CA9-9F57-7A7AC6FE87D6}");
                    public static ID IsSmartCrop => new ID("{4C4899C4-CA85-451A-9FEE-ECCD678D01FD}");
                }
            }
        }
    }
}

 

6. Finally, Rendering Variant field where the actual service call and rendering HTML preparation take place:

using System;
using System.Drawing;
using System.IO;
using System.Web.UI.HtmlControls;
using Sitecore.Data.Fields;
using Sitecore.Data.Items;
using Sitecore.Resources.Media;
using Sitecore.SecurityModel;
using Sitecore.XA.Foundation.RenderingVariants.Pipelines.RenderVariantField;
using Sitecore.XA.Foundation.Variants.Abstractions.Models;
using Sitecore.XA.Foundation.Variants.Abstractions.Pipelines.RenderVariantField;
using Sitecore91.Foundation.CognitiveServices.ComputerVision;

namespace Sitecore91.Foundation.SXAExtensions.Pipelines.VariantFields.SmartThumbnail
{
    public class RenderSmartThumbnail : RenderRenderingVariantFieldProcessor
    {
        private readonly IComputerVisionService _computerVisionService;
        public override Type SupportedType => typeof(SmartThumbnailVariant);
        public override RendererMode RendererMode => RendererMode.Html;

        public RenderSmartThumbnail()
        {
            _computerVisionService = ServiceLocator.ServiceProvider.GetService<IComputerVisionService>();
        }

        public override void RenderField(RenderVariantFieldArgs args)
        {
            var variantField = args.VariantField as SmartThumbnailVariant;
            var imageUrl = default(string);

            if (args.Item != null && !string.IsNullOrWhiteSpace(variantField?.FieldName))
            {
                ImageField imageField = args.Item.Fields[variantField.FieldName];
                if (imageField?.MediaItem != null)
                {
                    var thumbNameSuffix = "_thumb";
                    var newItemPath = imageField.MediaItem.Paths.Path + thumbNameSuffix;
                    var mediaItem = (MediaItem) imageField.MediaItem;

                    var thumbMediaItem = mediaItem.Database.GetItem(newItemPath);
                    if (thumbMediaItem != null)
                    {
                        using (new SecurityDisabler())
                        {
                            thumbMediaItem.Delete();
                        }
                    }

                    var image = (byte[])new ImageConverter().ConvertTo(Image.FromStream(mediaItem.GetMediaStream()), typeof(byte[]));
                    var thumbnail = _computerVisionService.GetThumbnail(image, variantField.Width, variantField.Height, variantField.IsSmartCrop);

                    if (thumbnail == null)
                    {
                        return;
                    }

                    using (var memoryStream = new MemoryStream(thumbnail))
                    {
                        var mediaCreator = new MediaCreator();
                        var options = new MediaCreatorOptions
                        {
                            Versioned = false,
                            IncludeExtensionInItemName = false,
                            Database = mediaItem.Database,
                            Destination = newItemPath,
                            FileBased = false
                        };

                        using (new SecurityDisabler())
                        {
                            var newFileName = mediaItem.Name + thumbNameSuffix + "." + mediaItem.Extension;
                            thumbMediaItem = mediaCreator.CreateFromStream(memoryStream, newFileName, options);
                        }
                    }
                    imageUrl = MediaManager.GetMediaUrl(thumbMediaItem);
                }
            }

            if (string.IsNullOrWhiteSpace(imageUrl))
            {
                return;
            }

            var control = new HtmlGenericControl("img");
            control.Attributes.Add("src", imageUrl);

            args.ResultControl = control;
            args.Result = RenderControl(args.ResultControl);
        }
    }
}

 

If all gone well, we should see a smartly (considering the image centre of interest) cropped to desired size image as a part of our Rendering Variant:

 

The given example is just an outline use case that generates some simple HTML code to display the image. Naturally, it can be extended further to offer more flexibility, e.g. like the 'Responsive Image' field does.