crankshaft/doc/11_kmeans.md
2017-01-10 10:43:42 -05:00

4.5 KiB

K-Means Functions

CDB_KMeans(subquery text, no_clusters INTEGER)

This function attempts to find no_clusters clusters within the input data based on the geographic distribution. It will return a table with ids and the cluster classification of each point input assuming the_geom is not null-valued. If the_geom is null-valued, the point will not be considered in the analysis.

Arguments

Name Type Description
subquery TEXT SQL query that exposes the data to be analyzed (e.g., SELECT * FROM interesting_table). This query must have the geometry column name the_geom and id column name cartodb_id unless otherwise specified in the input arguments
no_clusters INTEGER The number of clusters to try and find

Returns

A table with the following columns.

Column Name Type Description
cartodb_id INTEGER The row id of the row from the input table
cluster_no INTEGER The cluster that this point belongs to

Example Usage

SELECT
    customers.*,
    km.cluster_no
    FROM
      cdb_crankshaft.CDB_Kmeans('SELECT * from customers' , 6) As km,
      customers
    WHERE customers.cartodb_id = km.cartodb_id

CDB_WeightedMean(subquery text, weight_column text, category_column text)

Function that computes the weighted centroid of a number of clusters by some weight column.

Arguments

Name Type Description
subquery TEXT SQL query that exposes the data to be analyzed (e.g., SELECT * FROM interesting_table). This query must have the geometry column and the columns specified as the weight and category columns
weight_column TEXT The name of the column to use as a weight
category_column TEXT The name of the column to use as a category

Returns

A table with the following columns.

Column Name Type Description
the_geom GEOMETRY A point for the weighted cluster center
class INTEGER The cluster class

Example Usage

SELECT
  ST_Transform(the_geom, 3857) As the_geom_webmercator,
  class
FROM
  cdb_crankshaft.CDB_Weighted_Mean(
    'SELECT *, customer_value FROM customers',
    'customer_value',
    'cluster_no')

CDB_KMeansNonspatial(subquery text, colnames text[], no_clusters int)

K-means clustering classifies the rows of your dataset into no_clusters by finding the centers (means) of the variables in colnames and classifying each row by it's proximity to the nearest center. This method partitions space into distinct Voronoi cells.

As a standard machine learning method, k-means clustering is an unsupervised learning technique that finds the natural clustering of values. For instance, it is useful for finding subgroups in census data leading to demographic segmentation.

Arguments

Name Type Description
query TEXT SQL query to expose the data to be used in the analysis (e.g., SELECT * FROM iris_data). It should contain at least the columns specified in colnames and the id_colname.
colnames TEXT[] Array of columns to be used in the analysis (e.g., Array['petal_width', 'sepal_length', 'petal_length']).
no_clusters INTEGER Number of clusters for the classification of the data
id_colname (optaional) TEXT The id column (default: 'cartodb_id') for identifying rows
standarize (optional) BOOLEAN Setting this to true (default) standardizes the data to have a mean at zero and a standard deviation of 1

Returns

A table with the following columns.

Column Type Description
cluster_label TEXT Label that a cluster belongs to, number from 0 to no_clusters - 1.
cluster_center JSON Center of the cluster that a row belongs to. The keys of the JSON object are the colnames, with values that are the center of the respective cluster
silhouettes NUMERIC Silhouette score of the cluster label
rowid BIGINT id of the original row for associating back with the original data

Resources