K-Means with OpenCL

This is a little tutorial for implementing k-means algorithm in opencl for creating image palette. OpenCL is used (if enabled) by photon for quick image scaling and creating a paletted image so it can be quickly transformed to sixel data.

K-means can be used to find k clusters of data, generally vectors. If we use the pixel colors of the image as a set of vectors, we can find k colors that represent the reduced palette of the image. Using GPGPU will speed things up exponentially.

We start with k centers of clusters that we initialize with random pixels in the image.

//pseudo random generator; seed passed to kernel + globalID
inline uint rand(uint seed) {
	seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
	uint result = seed >> 16;
	return result;

__kernel void random_centers_from_image(
		read_only image2d_t img,
		uint img_width,
		uint img_height,
		__global float4 *centers,
		uint seed
	) {
   const int i = get_global_id(0);
   const uint x = rand(seed+i*2) % img_width;
   const uint y = rand(seed+i*3) % img_height;
   const int2 pos = { x, y };
   const float4 value = read_imagef(img, sampler, pos);
   centers[i] = value;

The random_centers_from_image kernel initializes an array of float4 vectors, which represent the centers of the clusters, picking the pixels at random with a pseudo random generator plus a seed (which can be just time_t).

Now the k-means algorithm consists of tree steps:

  1. assign vectors to the nearest center
  2. recompute the center from the vectors assigned to it
  3. when no assignments where changed stop, else go to 1.

Assignments with opencl are great. Nothing beats the speed by which the GPU can compute the euclidean distance from all vectors to all centers.

__kernel void assign_to_centers(
		read_only image2d_t img,
		uint img_width,
		__global float4 *centers,
		uint k,
		__global uchar *assignments,
		__global int *changed
) {
   const int gx = get_global_id(0);
   const int gy = get_global_id(1);
   const int2 pos = { gx, gy };
   const float4 value = read_imagef(img, sampler, pos);
   const int ai = (img_width*gy)+gx;
   float dd = 1.0E30f;
   uchar indMin = -1;
   for (int i = 0; i < k; i++) {
       float localD = length(centers[i] - value);
	   if (localD < dd) {
		   indMin = i;
		   dd = localD;
   if (assignments[ai] != indMin) changed[0] = 1;
   assignments[ai] = indMin;

It takes the original img, centers, an array of indexes from pixels to the assigned centers (an img_width times img_height array of uchar) and a pointer that signalizes that the assignments changed. The length function is here the key for the speed, because it is build in to opencl and it calculates the distance between the center and the pixel. There are so many workers as there are pixels in the image. gx and gy are the index to the currently processed pixel.

Now when the data vectors (pixels) where assigned to the nearest centers, we must recompute the new centers by calculating the means of the assigned vectors.

__kernel void recompute_centers(
		read_only image2d_t img,
		uint img_width,
		uint img_height,
		__global float4 *centers,
		__global ulong *global_sum,
		__global uint *global_count,
		__global uchar *assignments,
		__local float4 *sum,
		__local uint *count
) {
   const int window = 8;
   const int gid = get_global_id(0);
   const int lid = get_local_id(0);
   const int local_size = get_local_size(0);
   sum[lid] = 0;
   count[lid] = 0;
   global_sum[lid] = 0;
   global_count[lid] = 0;
   const int item_offset = gid*window;
   for (int x = item_offset; x < (item_offset + window) && x < img_width; x++) {
      for (int y = item_offset; y < (item_offset + window) && y < img_height; y++) {
		 if (lid == assignments[img_width*y+x]) {
			 const int2 pos = {x,y};
			 const float4 value = read_imagef(img, sampler, pos);
			 sum[lid] += value;
   if (sum[lid].x != 0) atom_add(global_sum + lid*4, (int)(sum[lid].x*255));
   if (sum[lid].y != 0) atom_add(global_sum + lid*4+1, (int)(sum[lid].y*255));
   if (sum[lid].z != 0) atom_add(global_sum + lid*4+2, (int)(sum[lid].z*255));
   if (sum[lid].w != 0) atom_add(global_sum + lid*4+3, (int)(sum[lid].w*255));
   if (count[lid] != 0) atomic_add(global_count + lid, count[lid]);
   if (gid == 0) {
	   for (int i=0; i<local_size; i++) {
		   if (global_count[i] == 0) continue;
	       centers[i].x = global_sum[i*4] / global_count[i] / 255.0;
	       centers[i].y = global_sum[i*4+1] / global_count[i] / 255.0;
	       centers[i].z = global_sum[i*4+2] / global_count[i] / 255.0;
	       centers[i].w = global_sum[i*4+3] / global_count[i] / 255.0;

This is more complicated and there is a scary barrier, but bare with me. To compute the sum of all the assigned pixels in a parallel fashion, we can’t just add the value to a global sum, that would be too slow.

For this we use a 8 by 8 window of pixels and every worker sums up just the pixels that he is responsible for. We have a global size of (img_width * img_heigh +1) / WINDOW_SIZE and a local size of k. So for every window there are k workers that sum up his own pixels to the local memory.

When all workers have completed the processing of the window (there is the barrier) we atomically add the local results to the global sum and count.

Then the last step is just for the worker with gid 0 to compute the means of all the centers.

Every time we check the changed value (from the assign_to_centers) and when there where no changes, we quit.

The result is a palette of k colors and a matrix of pixels with indexes to the palette.