crankshaft/doc/12_segmentation.md
2018-01-08 16:30:03 -05:00

4.6 KiB

Segmentation Functions

CDB_CreateAndPredictSegment(query TEXT, variable_name TEXT, target_query TEXT)

This function trains a Gradient Boosting model to attempt to predict the target data and then generates predictions for new data.

Arguments

Name Type Description
query TEXT The input query to train the algorithm, which should have both the variable of interest and the features that will be used to predict it
variable_name TEXT Specify the variable in the query to predict, all other columns are assumed to be features
target_table TEXT The query which returns the cartodb_id and features for the rows your would like to predict the target variable for
n_estimators (optional) INTEGER DEFAULT 1200 Number of estimators to be used. Values should be between 1 and x.
max_depth (optional) INTEGER DEFAULT 3 Max tree depth. Values should be between 1 and n.
subsample (optional) DOUBLE PRECISION DEFAULT 0.5 Subsample parameter for GradientBooster. Values should be within the range 0 to 1.
learning_rate (optional) DOUBLE PRECISION DEFAULT 0.01 Learning rate for the GradientBooster. Values should be between 0 and 1 (??)
min_samples_leaf (optional) INTEGER DEFAULT 1 Minimum samples to use per leaf. Values should range from x to y

Returns

A table with the following columns.

Column Name Type Description
cartodb_id INTEGER The CartoDB id of the row in the target_query
prediction NUMERIC The predicted value of the variable of interest
accuracy NUMERIC The mean squared accuracy of the model.

Example Usage

SELECT * from cdb_crankshaft.CDB_CreateAndPredictSegment(
'SELECT agg, median_rent::numeric, male_pop::numeric, female_pop::numeric FROM late_night_agg',
'agg',
'SELECT row_number() OVER () As cartodb_id, median_rent, male_pop, female_pop FROM ml_learning_ny');

CDB_CreateAndPredictSegment(target numeric[], train_features numeric[], prediction_features numeric[], prediction_ids numeric[])

This function trains a Gradient Boosting model to attempt to predict the target data and then generates predictions for new data.

Arguments

Name Type Description
target numeric[] An array of target values of the variable you want to predict
train_features numeric[] 1D array of length n features * n_rows + 1 with the first entry in the array being the number of features in each row. These are the features the model will be trained on. CDB_Crankshaft.CDB_pyAgg(Array[feature1, feature2, feature3]::numeric[]) can be used to construct this.
prediction_features numeric[] 1D array of length nfeatures* n_rows_ + 1 with the first entry in the array being the number of features in each row. These are the features that will be used to predict the target variable CDB_Crankshaft.CDB_pyAgg(Array[feature1, feature2, feature3]::numeric[]) can be used to construct this.
prediction_ids numeric[] 1D array of length n_rows with the ids that can use used to re-join the data with inputs
n_estimators (optional) INTEGER DEFAULT 1200 Number of estimators to be used
max_depth (optional) INTEGER DEFAULT 3 Max tree depth
subsample (optional) DOUBLE PRECISION DEFAULT 0.5 Subsample parameter for GradientBooster
learning_rate (optional) DOUBLE PRECISION DEFAULT 0.01 Learning rate for the GradientBooster
min_samples_leaf (optional) INTEGER DEFAULT 1 Minimum samples to use per leaf

Returns

A table with the following columns.

Column Name Type Description
cartodb_id INTEGER The CartoDB id of the row in the target_query
prediction NUMERIC The predicted value of the variable of interest
accuracy NUMERIC The mean squared accuracy of the model.

Example Usage

WITH training As (
    SELECT array_agg(agg) As target,
           cdb_crankshaft.CDB_PyAgg(Array[median_rent, male_pop, female_pop]::Numeric[]) As features
    FROM late_night_agg),
target AS (
    SELECT cdb_crankshaft.CDB_PyAgg(Array[median_rent, male_pop, female_pop]::Numeric[]) As features,
     array_agg(cartodb_id) As cartodb_ids FROM late_night_agg)

SELECT cdb_crankshaft.CDB_CreateAndPredictSegment(training.target, training.features, target.features, target.cartodb_ids)
FROM training, target;