crankshaft/doc/11_kmeans.md

## K-Means Functions

k-means clustering is a popular technique for finding clusters in data by minimizing the intra-cluster 'distance' and maximizing the inter-cluster 'distance'. The distance is defined in the parameter space of the variables entered.

### CDB_KMeans(subquery text, no_clusters integer)

This function attempts to find `no_clusters` clusters within the input data based on the geographic distribution. It will return a table with ids and the cluster classification of each point input assuming `the_geom` is not null-valued. If `the_geom` is null-valued, the point will not be considered in the analysis.

#### Arguments

| Name | Type | Description |
|------|------|-------------|
| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
| no\_clusters | INTEGER | The number of clusters to find |

#### Returns

A table with the following columns.

| Column Name | Type | Description |
|-------------|------|-------------|
| cartodb\_id | INTEGER | The row id of the row from the input table |
| cluster\_no | INTEGER | The cluster that this point belongs to |


#### Example Usage

```sql
SELECT
    customers.*,
    km.cluster_no
FROM
    cdb_crankshaft.CDB_KMeans('SELECT * from customers' , 6) As km,
    customers
WHERE
    customers.cartodb_id = km.cartodb_id
```

### CDB_WeightedMean(subquery text, weight_column text, category_column text)

Function that computes the weighted centroid of a number of clusters by some weight column.

### Arguments

| Name | Type | Description |
|------|------|-------------|
| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column and the columns specified as the weight and category columns|
| weight\_column | TEXT | The name of the column to use as a weight |
| category\_column | TEXT | The name of the column to use as a category |

### Returns

A table with the following columns.

| Column Name | Type | Description |
|-------------|------|-------------|
| the\_geom | GEOMETRY | A point for the weighted cluster center |
| class | INTEGER | The cluster class |

### Example Usage

```sql
SELECT
    ST_Transform(km.the_geom, 3857) As the_geom_webmercator,
    km.class
FROM
    cdb_crankshaft.CDB_WeightedMean(
        'SELECT *, customer_value FROM customers',
        'customer_value',
        'cluster_no') As km
```

## CDB_KMeansNonspatial(subquery text, colnames text[], no_clusters int)

K-means clustering classifies the rows of your dataset into `no_clusters` by finding the centers (means) of the variables in `colnames` and classifying each row by it's proximity to the nearest center. This method partitions space into distinct Voronoi cells.

As a standard machine learning method, k-means clustering is an unsupervised learning technique that finds the natural clustering of values. For instance, it is useful for finding subgroups in census data leading to demographic segmentation.

### Arguments

| Name | Type | Description |
|------|------|-------------|
| query | TEXT | SQL query to expose the data to be used in the analysis (e.g., `SELECT * FROM iris_data`). It should contain at least the columns specified in `colnames` and the `id_colname`. |
| colnames | TEXT[] | Array of columns to be used in the analysis (e.g., `Array['petal_width', 'sepal_length', 'petal_length']`). |
| no\_clusters | INTEGER | Number of clusters for the classification of the data |
| id\_col (optional) | TEXT | The id column (default: 'cartodb_id') for identifying rows |
| standarize (optional) | BOOLEAN | Setting this to true (default) standardizes the data to have a mean at zero and a standard deviation of 1 |

### Returns

A table with the following columns.

| Column | Type | Description |
|--------|------|-------------|
| cluster_label | TEXT | Label that a cluster belongs to, number from 0 to `no_clusters - 1`. |
| cluster_center | JSON | Center of the cluster that a row belongs to. The keys of the JSON object are the `colnames`, with values that are the center of the respective cluster |
| silhouettes | NUMERIC | [Silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) of the cluster label |
| inertia | NUMERIC | Sum of squared distances of samples to their closest cluster center |
| rowid | BIGINT | id of the original row for associating back with the original data |

### Example Usage

```sql
SELECT
    customers.*,
    km.cluster_label,
    km.cluster_center,
    km.silhouettes
FROM
    cdb_crankshaft.CDB_KMeansNonspatial(
        'SELECT * FROM customers',
        Array['customer_value', 'avg_amt_spent', 'home_median_income'],
        7) As km,
    customers
WHERE
    customers.cartodb_id = km.rowid
```

### Resources

-   Read more in [scikit-learn's documentation](http://scikit-learn.org/stable/modules/clustering.html#k-means)
-   [K-means basics](https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials)
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00			`## K-Means Functions`

adds inertia as an output column 2018-01-10 02:02:55 +08:00			`k-means clustering is a popular technique for finding clusters in data by minimizing the intra-cluster 'distance' and maximizing the inter-cluster 'distance'. The distance is defined in the parameter space of the variables entered.`

Merge branch 'develop' into add-nonspatial-kmeans-w-class-framework 2018-01-10 02:47:29 +08:00			`### CDB_KMeans(subquery text, no_clusters integer)`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			This function attempts to find `no_clusters` clusters within the input data based on the geographic distribution. It will return a table with ids and the cluster classification of each point input assuming `the_geom` is not null-valued. If `the_geom` is null-valued, the point will not be considered in the analysis.
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
			`#### Arguments`

			`\| Name \| Type \| Description \|`
			`\|------\|------\|-------------\|`
			\| subquery \| TEXT \| SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments \|
adds inertia as an output column 2018-01-10 02:02:55 +08:00			`\| no\_clusters \| INTEGER \| The number of clusters to find \|`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
			`#### Returns`

			`A table with the following columns.`

			`\| Column Name \| Type \| Description \|`
			`\|-------------\|------\|-------------\|`
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`\| cartodb\_id \| INTEGER \| The row id of the row from the input table \|`
			`\| cluster\_no \| INTEGER \| The cluster that this point belongs to \|`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00

			`#### Example Usage`

			```sql
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`SELECT`
			`customers.*,`
			`km.cluster_no`
adds inertia as an output column 2018-01-10 02:02:55 +08:00			`FROM`
			`cdb_crankshaft.CDB_KMeans('SELECT * from customers' , 6) As km,`
			`customers`
			`WHERE`
			`customers.cartodb_id = km.cartodb_id`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00			```

			`### CDB_WeightedMean(subquery text, weight_column text, category_column text)`

			`Function that computes the weighted centroid of a number of clusters by some weight column.`

finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`### Arguments`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
			`\| Name \| Type \| Description \|`
			`\|------\|------\|-------------\|`
			\| subquery \| TEXT \| SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column and the columns specified as the weight and category columns\|
			`\| weight\_column \| TEXT \| The name of the column to use as a weight \|`
			`\| category\_column \| TEXT \| The name of the column to use as a category \|`

finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`### Returns`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
			`A table with the following columns.`

			`\| Column Name \| Type \| Description \|`
			`\|-------------\|------\|-------------\|`
			`\| the\_geom \| GEOMETRY \| A point for the weighted cluster center \|`
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`\| class \| INTEGER \| The cluster class \|`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`### Example Usage`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			```sql
			`SELECT`
adds inertia as an output column 2018-01-10 02:02:55 +08:00			`ST_Transform(km.the_geom, 3857) As the_geom_webmercator,`
			`km.class`
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`FROM`
adds inertia as an output column 2018-01-10 02:02:55 +08:00			`cdb_crankshaft.CDB_WeightedMean(`
			`'SELECT *, customer_value FROM customers',`
			`'customer_value',`
			`'cluster_no') As km`
KMeans clustering and weighted centroid analysis 2016-06-08 03:58:32 +08:00			```
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00
			`## CDB_KMeansNonspatial(subquery text, colnames text[], no_clusters int)`

			K-means clustering classifies the rows of your dataset into `no_clusters` by finding the centers (means) of the variables in `colnames` and classifying each row by it's proximity to the nearest center. This method partitions space into distinct Voronoi cells.

			`As a standard machine learning method, k-means clustering is an unsupervised learning technique that finds the natural clustering of values. For instance, it is useful for finding subgroups in census data leading to demographic segmentation.`

			`### Arguments`

			`\| Name \| Type \| Description \|`
			`\|------\|------\|-------------\|`
			\| query \| TEXT \| SQL query to expose the data to be used in the analysis (e.g., `SELECT * FROM iris_data`). It should contain at least the columns specified in `colnames` and the `id_colname`. \|
			\| colnames \| TEXT[] \| Array of columns to be used in the analysis (e.g., `Array['petal_width', 'sepal_length', 'petal_length']`). \|
			`\| no\_clusters \| INTEGER \| Number of clusters for the classification of the data \|`
adds inertia as an output column 2018-01-10 02:02:55 +08:00			`\| id\_col (optional) \| TEXT \| The id column (default: 'cartodb_id') for identifying rows \|`
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`\| standarize (optional) \| BOOLEAN \| Setting this to true (default) standardizes the data to have a mean at zero and a standard deviation of 1 \|`

			`### Returns`

			`A table with the following columns.`

			`\| Column \| Type \| Description \|`
			`\|--------\|------\|-------------\|`
			\| cluster_label \| TEXT \| Label that a cluster belongs to, number from 0 to `no_clusters - 1`. \|
			\| cluster_center \| JSON \| Center of the cluster that a row belongs to. The keys of the JSON object are the `colnames`, with values that are the center of the respective cluster \|
			`\| silhouettes \| NUMERIC \| [Silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) of the cluster label \|`
adds inertia as an output column 2018-01-10 02:02:55 +08:00			`\| inertia \| NUMERIC \| Sum of squared distances of samples to their closest cluster center \|`
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00			`\| rowid \| BIGINT \| id of the original row for associating back with the original data \|`

adds inertia as an output column 2018-01-10 02:02:55 +08:00			`### Example Usage`

			```sql
			`SELECT`
			`customers.*,`
			`km.cluster_label,`
			`km.cluster_center,`
			`km.silhouettes`
			`FROM`
			`cdb_crankshaft.CDB_KMeansNonspatial(`
			`'SELECT * FROM customers',`
			`Array['customer_value', 'avg_amt_spent', 'home_median_income'],`
			`7) As km,`
			`customers`
			`WHERE`
			`customers.cartodb_id = km.rowid`
			```
finish docs for kmeans nonspatial 2017-01-10 23:43:42 +08:00
			`### Resources`

			`- Read more in [scikit-learn's documentation](http://scikit-learn.org/stable/modules/clustering.html#k-means)`
			`- [K-means basics](https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials)`