5.1 KiB
K-Means Functions
k-means clustering is a popular technique for finding clusters in data by minimizing the intra-cluster 'distance' and maximizing the inter-cluster 'distance'. The distance is defined in the parameter space of the variables entered.
CDB_KMeans(subquery text, no_clusters INTEGER)
This function attempts to find no_clusters
clusters within the input data based on the geographic distribution. It will return a table with ids and the cluster classification of each point input assuming the_geom
is not null-valued. If the_geom
is null-valued, the point will not be considered in the analysis.
Arguments
Name | Type | Description |
---|---|---|
subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., SELECT * FROM interesting_table ). This query must have the geometry column name the_geom and id column name cartodb_id unless otherwise specified in the input arguments |
no_clusters | INTEGER | The number of clusters to find |
Returns
A table with the following columns.
Column Name | Type | Description |
---|---|---|
cartodb_id | INTEGER | The row id of the row from the input table |
cluster_no | INTEGER | The cluster that this point belongs to |
Example Usage
SELECT
customers.*,
km.cluster_no
FROM
cdb_crankshaft.CDB_KMeans('SELECT * from customers' , 6) As km,
customers
WHERE
customers.cartodb_id = km.cartodb_id
CDB_WeightedMean(subquery text, weight_column text, category_column text)
Function that computes the weighted centroid of a number of clusters by some weight column.
Arguments
Name | Type | Description |
---|---|---|
subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., SELECT * FROM interesting_table ). This query must have the geometry column and the columns specified as the weight and category columns |
weight_column | TEXT | The name of the column to use as a weight |
category_column | TEXT | The name of the column to use as a category |
Returns
A table with the following columns.
Column Name | Type | Description |
---|---|---|
the_geom | GEOMETRY | A point for the weighted cluster center |
class | INTEGER | The cluster class |
Example Usage
SELECT
ST_Transform(km.the_geom, 3857) As the_geom_webmercator,
km.class
FROM
cdb_crankshaft.CDB_WeightedMean(
'SELECT *, customer_value FROM customers',
'customer_value',
'cluster_no') As km
CDB_KMeansNonspatial(subquery text, colnames text[], no_clusters int)
K-means clustering classifies the rows of your dataset into no_clusters
by finding the centers (means) of the variables in colnames
and classifying each row by it's proximity to the nearest center. This method partitions space into distinct Voronoi cells.
As a standard machine learning method, k-means clustering is an unsupervised learning technique that finds the natural clustering of values. For instance, it is useful for finding subgroups in census data leading to demographic segmentation.
Arguments
Name | Type | Description |
---|---|---|
query | TEXT | SQL query to expose the data to be used in the analysis (e.g., SELECT * FROM iris_data ). It should contain at least the columns specified in colnames and the id_colname . |
colnames | TEXT[] | Array of columns to be used in the analysis (e.g., Array['petal_width', 'sepal_length', 'petal_length'] ). |
no_clusters | INTEGER | Number of clusters for the classification of the data |
id_col (optional) | TEXT | The id column (default: 'cartodb_id') for identifying rows |
standarize (optional) | BOOLEAN | Setting this to true (default) standardizes the data to have a mean at zero and a standard deviation of 1 |
Returns
A table with the following columns.
Column | Type | Description |
---|---|---|
cluster_label | TEXT | Label that a cluster belongs to, number from 0 to no_clusters - 1 . |
cluster_center | JSON | Center of the cluster that a row belongs to. The keys of the JSON object are the colnames , with values that are the center of the respective cluster |
silhouettes | NUMERIC | Silhouette score of the cluster label |
inertia | NUMERIC | Sum of squared distances of samples to their closest cluster center |
rowid | BIGINT | id of the original row for associating back with the original data |
Example Usage
SELECT
customers.*,
km.cluster_label,
km.cluster_center,
km.silhouettes
FROM
cdb_crankshaft.CDB_KMeansNonspatial(
'SELECT * FROM customers',
Array['customer_value', 'avg_amt_spent', 'home_median_income'],
7) As km,
customers
WHERE
customers.cartodb_id = km.rowid
Resources
- Read more in scikit-learn's documentation
- K-means basics