Merge pull request #202 from CartoDB/iss201-add-lag-to-moran-i

creates new moran's i functions which expose spatial lag
2018-03-12 11:21:14 -04:00 · 2018-03-12 11:21:14 -04:00 · 7cf84133b8
commit 7cf84133b8
parent b547d35882 b86384bb94
7 changed files with 579 additions and 46 deletions
--- a/NEWS.md
+++ b/NEWS.md
@ -1,3 +1,7 @@
 0.8.0 (yyyy-mm-dd)
 ------------------
 * Adds `CDB_MoransILocal*` functions that return spatial lag [#202](https://github.com/CartoDB/crankshaft/pull/202)
 0.7.0 (2018-02-23)
 ------------------
 * Updated Moran and Markov documentation [#179](https://github.com/CartoDB/crankshaft/pull/179) [#155](https://github.com/CartoDB/crankshaft/pull/155)
--- a/doc/02_moran.md
+++ b/doc/02_moran.md
@ -1,4 +1,6 @@
-## Areas of Interest Functions
+## Moran's I - Spatial Autocorrelation
 Note: these functions are replacing the functions in the _Areas of Interest_ family (still documented below). `CDB_MoransILocal` and `CDB_MoransILocalRate` perform the same analysis as their `CDB_AreasOfInterest*` counterparts but return spatial lag information, which is needed for creating the Moran's I scatter plot. It recommended to use the `CDB_MoransILocal*` variants instead as they will be maintained and improved going foward.
 A family of analyses to uncover groupings of areas with consistently high or low values (clusters) and smaller areas with values unlike those around them (outliers). A cluster is labeled by an 'HH' (high value compared to the entire dataset in an area with other high values), or its opposite 'LL'. An outlier is labeled by an 'LH' (low value surrounded by high values) or an 'HL' (the opposite). Each cluster and outlier classification has an associated p-value, a measure of how significant the pattern of highs and lows is compared to a random distribution.
@ -9,7 +11,108 @@ These functions have two forms: local and global. The local versions classify ev
 *   Rows with null values will be omitted from this analysis. To ensure they are added to the analysis, fill the null-valued cells with an appropriate value such as the mean of a column, the mean of the most recent two time steps, or use a `LEFT JOIN` to get null outputs from the analysis.
 *   Input query can only accept tables (datasets) in the users database account. Common table expressions (CTEs) do not work as an input unless specified within the `subquery` argument.
-### CDB_AreasOfInterestLocal(subquery text, column_name text)
+### CDB_MoransILocal(subquery text, column_name text)
 This function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. The classification happens through an autocorrelation statistic called Local Moran's I.
 #### Arguments
 | Name | Type | Description |
 |------|------|-------------|
 | subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
 | column_name | TEXT | Name of column (e.g., should be `'interesting_value'` instead of `interesting_value` without single quotes) used for the analysis. |
 | weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
 | num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
 | permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
 | geom_col (optional) | TEXT | The column name for the geometries. Defaults to `'the_geom'` |
 | id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `'cartodb_id'`. |
 #### Returns
 A table with the following columns.
 | Column Name | Type | Description |
 |-------------|------|-------------|
 | quads | TEXT | Classification of geometry. Result is one of 'HH' (a high value with neighbors high on average), 'LL' (opposite of 'HH'), 'HL' (a high value surrounded by lows on average), and 'LH' (opposite of 'HL'). Null values are returned when nulls exist in the original data. |
 | significance | NUMERIC | The statistical significance (from 0 to 1) of a cluster or outlier classification. Lower numbers are more significant. |
 | spatial\_lag | NUMERIC | The 'average' of the neighbors of the value in this row. The average is calculated from it's neighborhood -- defined by `weight_type`. |
 | spatial\_lag\_std | NUMERIC | The standardized version of `spatial_lag` -- that is, centered on the mean and divided by the standard deviation. Useful as the y-axis in a Moran's I scatter plot. |
 | orig\_val | NUMERIC | Values from `column_name`. |
 | orig\_val\_std | NUMERIC | Values from `column_name` but centered on the mean and divided by the standard devation. Useful as the x-axis in Moran's I scatter plots. |
 | moran\_stat | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the geometry with id of `rowid` |
 | rowid | INT | Row id of the values which correspond to the input rows. |
 #### Example Usage
 ```sql
 SELECT
  c.the_geom,
  m.quads,
  m.significance,
  c.num_cyclists_per_total_population
 FROM
  cdb_crankshaft.CDB_MoransILocal(
    'SELECT * FROM commute_data'
    'num_cyclists_per_total_population') As m
 JOIN commute_data As c
 ON c.cartodb_id = m.rowid;
 ```
 ### CDB_MoransILocalRate(subquery text, numerator text, denominator text)
 Just like `CDB_MoransILocal`, this function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. This function differs in that it calculates the classifications based on input `numerator` and `denominator` columns for finding the areas where there are clusters and outliers for the resulting rate of those two values.
 #### Arguments
 | Name | Type | Description |
 |------|------|-------------|
 | subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
 | numerator | TEXT | Name of the numerator for forming a rate to be used in analysis. |
 | denominator | TEXT | Name of the denominator for forming a rate to be used in analysis. |
 | weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
 | num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
 | permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
 | geom_col (optional) | TEXT | The column name for the geometries. Defaults to `the_geom` |
 | id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `cartodb_id`. |
 #### Returns
 A table with the following columns.
 | Column Name | Type | Description |
 |-------------|------|-------------|
 | quads | TEXT | Classification of geometry. Result is one of 'HH' (a high value with neighbors high on average), 'LL' (opposite of 'HH'), 'HL' (a high value surrounded by lows on average), and 'LH' (opposite of 'HL'). Null values are returned when nulls exist in the original data. |
 | significance | NUMERIC | The statistical significance (from 0 to 1) of a cluster or outlier classification. Lower numbers are more significant. |
 | spatial\_lag | NUMERIC | The 'average' of the neighbors of the value in this row. The average is calculated from it's neighborhood -- defined by `weight_type`. |
 | spatial\_lag\_std | NUMERIC | The standardized version of `spatial_lag` -- that is, centered on the mean and divided by the standard deviation. |
 | orig\_val | NUMERIC | Standardized rate (centered on the mean and normalized by the standard deviation) calculated from `numerator` and `denominator`. This is calculated by [Assuncao Rate](http://pysal.readthedocs.io/en/latest/library/esda/smoothing.html?highlight=assuncao#pysal.esda.smoothing.assuncao_rate) in the PySAL library. |
 | orig\_val\_std | NUMERIC | Values from `column_name` but centered on the mean and divided by the standard devation. Useful as the x-axis in Moran's I scatter plots. |
 | moran\_stat | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the geometry with id of `rowid` |
 | rowid | INT | Row id of the values which correspond to the input rows. |
 A table with the following columns. |
 #### Example Usage
 ```sql
 SELECT
  c.the_geom,
  m.quads,
  m.significance,
  c.cyclists_per_total_population
 FROM
    cdb_crankshaft.CDB_MoransILocalRate(
        'SELECT * FROM commute_data'
        'num_cyclists',
        'total_population') As m
 JOIN commute_data As c
 ON c.cartodb_id = m.rowid;
 ```
 ### CDB_AreasOfInterestLocal(subquery text, column_name text) (deprecated)
 This function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. The classification happens through an autocorrelation statistic called Local Moran's I.
@ -55,7 +158,7 @@ JOIN commute_data As c
 ON c.cartodb_id = aoi.rowid;
 ```
-### CDB_AreasOfInterestGlobal(subquery text, column_name text)
+### CDB_AreasOfInterestGlobal(subquery text, column_name text) (deprecated)
 This function identifies the extent to which geometries cluster (the groupings of geometries with similarly high or low values relative to the mean) or form outliers (areas where geometries have values opposite of their neighbors). The output of this function gives values between -1 and 1 as well as a significance of that classification. Values close to 0 mean that there is little to no distribution of values as compared to what one would see in a randomly distributed collection of geometries and values.
@ -91,7 +194,7 @@ FROM
        'num_cyclists_per_total_population')
 ```
-### CDB_AreasOfInterestLocalRate(subquery text, numerator_column text, denominator_column text)
+### CDB_AreasOfInterestLocalRate(subquery text, numerator_column text, denominator_column text) (deprecated)
 Just like `CDB_AreasOfInterestLocal`, this function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. This function differs in that it calculates the classifications based on input `numerator` and `denominator` columns for finding the areas where there are clusters and outliers for the resulting rate of those two values.
@ -138,7 +241,7 @@ JOIN commute_data As c
 ON c.cartodb_id = aoi.rowid;
 ```
-### CDB_AreasOfInterestGlobalRate(subquery text, column_name text)
+### CDB_AreasOfInterestGlobalRate(subquery text, column_name text) (deprecated)
 This function identifies the extent to which geometries cluster (the groupings of geometries with similarly high or low values relative to the mean) or form outliers (areas where geometries have values opposite of their neighbors). The output of this function gives values between -1 and 1 as well as a significance of that classification. Values close to 0 mean that there is little to no distribution of values as compared to what one would see in a randomly distributed collection of geometries and values.
@ -178,7 +281,7 @@ FROM
 ## Hotspot, Coldspot, and Outlier Functions
-These functions are convenience functions for extracting only information that you are interested in exposing based on the outputs of the `CDB_AreasOfInterest` functions. For instance, you can use `CDB_GetSpatialHotspots` to output only the classifications of `HH` and `HL`.
+These functions are convenience functions for extracting only information that you are interested in exposing based on the outputs of the `CDB_MoransI*` functions. For instance, you can use `CDB_GetSpatialHotspots` to output only the classifications of `HH` and `HL`.
 ### Non-rate functions
--- a/src/pg/sql/10_moran.sql
+++ b/src/pg/sql/10_moran.sql
@ -17,7 +17,7 @@ AS $$
                           num_ngbrs, permutations, geom_col, id_col)
 $$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local (internal function)
+-- Moran's I Local (internal function) - DEPRECATED
 CREATE OR REPLACE FUNCTION
  _CDB_AreasOfInterestLocal(
      subquery TEXT,
@ -27,16 +27,82 @@ CREATE OR REPLACE FUNCTION
      permutations INT,
      geom_col TEXT,
      id_col TEXT)
-RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
+RETURNS TABLE (
    moran NUMERIC,
    quads TEXT,
    significance NUMERIC,
    rowid INT,
    vals NUMERIC)
 AS $$
  from crankshaft.clustering import Moran
  moran = Moran()
-  # TODO: use named parameters or a dictionary
+  result = moran.local_stat(subquery, column_name, w_type,
  return moran.local_stat(subquery, column_name, w_type,
                            num_ngbrs, permutations, geom_col, id_col)
  # remove spatial lag
  return [(r[6], r[0], r[1], r[7], r[5]) for r in result]
 $$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
 -- Moran's I Local (internal function)
 CREATE OR REPLACE FUNCTION
  _CDB_MoransILocal(
      subquery TEXT,
      column_name TEXT,
      w_type TEXT,
      num_ngbrs INT,
      permutations INT,
      geom_col TEXT,
      id_col TEXT)
 RETURNS TABLE (
    quads TEXT,
    significance NUMERIC,
    spatial_lag NUMERIC,
    spatial_lag_std NUMERIC,
    orig_val NUMERIC,
    orig_val_std NUMERIC,
    moran_stat NUMERIC,
    rowid INT)
 AS $$
 from crankshaft.clustering import Moran
 moran = Moran()
 return moran.local_stat(subquery, column_name, w_type,
                        num_ngbrs, permutations, geom_col, id_col)
 $$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
 -- Moran's I Local (public-facing function)
 --  Replaces CDB_AreasOfInterestLocal
 CREATE OR REPLACE FUNCTION
  CDB_MoransILocal(
    subquery TEXT,
    column_name TEXT,
    w_type TEXT DEFAULT 'knn',
    num_ngbrs INT DEFAULT 5,
    permutations INT DEFAULT 99,
    geom_col TEXT DEFAULT 'the_geom',
    id_col TEXT DEFAULT 'cartodb_id')
 RETURNS TABLE (
    quads TEXT,
    significance NUMERIC,
    spatial_lag NUMERIC,
    spatial_lag_std NUMERIC,
    orig_val NUMERIC,
    orig_val_std NUMERIC,
    moran_stat NUMERIC,
    rowid INT)
 AS $$
  SELECT
    quads, significance, spatial_lag, spatial_lag_std,
    orig_val, orig_val_std, moran_stat, rowid
  FROM cdb_crankshaft._CDB_MoransILocal(
    subquery, column_name, w_type,
    num_ngbrs, permutations, geom_col, id_col);
 $$ LANGUAGE SQL VOLATILE PARALLEL UNSAFE;
 -- Moran's I Local (public-facing function) - DEPRECATED
 CREATE OR REPLACE FUNCTION
  CDB_AreasOfInterestLocal(
    subquery TEXT,
@ -132,7 +198,7 @@ AS $$
 $$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local Rate (internal function)
+-- Moran's I Local Rate (internal function) - DEPRECATED
 CREATE OR REPLACE FUNCTION
  _CDB_AreasOfInterestLocalRate(
      subquery TEXT,
@ -144,15 +210,22 @@ CREATE OR REPLACE FUNCTION
      geom_col TEXT,
      id_col TEXT)
 RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
+TABLE(
    moran NUMERIC,
    quads TEXT,
    significance NUMERIC,
    rowid INT,
    vals NUMERIC)
 AS $$
  from crankshaft.clustering import Moran
  moran = Moran()
  # TODO: use named parameters or a dictionary
-  return moran.local_rate_stat(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
+  result = moran.local_rate_stat(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
  # remove spatial lag
  return [(r[6], r[0], r[1], r[7], r[4]) for r in result]
 $$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local Rate (public-facing function)
+-- Moran's I Local Rate (public-facing function) - DEPRECATED
 CREATE OR REPLACE FUNCTION
  CDB_AreasOfInterestLocalRate(
      subquery TEXT,
@ -172,6 +245,75 @@ AS $$
 $$ LANGUAGE SQL VOLATILE PARALLEL UNSAFE;
 -- Internal function
 CREATE OR REPLACE FUNCTION
  _CDB_MoransILocalRate(
      subquery TEXT,
      numerator TEXT,
      denominator TEXT,
      w_type TEXT,
      num_ngbrs INT,
      permutations INT,
      geom_col TEXT,
      id_col TEXT)
 RETURNS
 TABLE(
    quads TEXT,
    significance NUMERIC,
    spatial_lag NUMERIC,
    spatial_lag_std NUMERIC,
    orig_val NUMERIC,
    orig_val_std NUMERIC,
    moran_stat NUMERIC,
    rowid INT)
 AS $$
 from crankshaft.clustering import Moran
 moran = Moran()
 return moran.local_rate_stat(
    subquery,
    numerator,
    denominator,
    w_type,
    num_ngbrs,
    permutations,
    geom_col,
    id_col
 )
 $$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
 -- Moran's I Rate
 -- Replaces CDB_AreasOfInterestLocalRate
 CREATE OR REPLACE FUNCTION
  CDB_MoransILocalRate(
      subquery TEXT,
      numerator TEXT,
      denominator TEXT,
      w_type TEXT DEFAULT 'knn',
      num_ngbrs INT DEFAULT 5,
      permutations INT DEFAULT 99,
      geom_col TEXT DEFAULT 'the_geom',
      id_col TEXT DEFAULT 'cartodb_id')
 RETURNS
 TABLE(
    quads TEXT,
    significance NUMERIC,
    spatial_lag NUMERIC,
    spatial_lag_std NUMERIC,
    orig_val NUMERIC,
    orig_val_std NUMERIC,
    moran_stat NUMERIC,
    rowid INT)
 AS $$
 SELECT
  quads, significance, spatial_lag, spatial_lag_std,
  orig_val, orig_val_std, moran_stat, rowid
 FROM cdb_crankshaft._CDB_MoransILocalRate(
  subquery, numerator, denominator, w_type,
  num_ngbrs, permutations, geom_col, id_col);
 $$ LANGUAGE SQL VOLATILE PARALLEL UNSAFE;
 -- Moran's I Local Rate only for HH and HL (public-facing function)
 CREATE OR REPLACE FUNCTION
  CDB_GetSpatialHotspotsRate(
--- a/src/pg/test/expected/02_moran_test.out
+++ b/src/pg/test/expected/02_moran_test.out
@ -68,6 +68,63 @@ code|quads
 (52 rows)
 _cdb_random_seeds
 (1 row)
 code|quads|diff_orig|expected|moran_stat_not_null|significance_not_null|value_comparison
 01|HH|t|t|t|t|t
 02|HL|t|t|t|t|t
 03|LL|t|t|t|t|t
 04|LL|t|t|t|t|t
 05|LH|t|t|t|t|t
 06|LL|t|t|t|t|t
 07|HH|t|t|t|t|t
 08|HH|t|t|t|t|t
 09|HH|t|t|t|t|t
 10|LL|t|t|t|t|t
 11|LL|t|t|t|t|t
 12|LL|t|t|t|t|t
 13|HL|t|t|t|t|t
 14|LL|t|t|t|t|t
 15|LL|t|t|t|t|t
 16|HH|t|t|t|t|t
 17|HH|t|t|t|t|t
 18|LL|t|t|t|t|t
 19|HH|t|t|t|t|t
 20|HH|t|t|t|t|t
 21|LL|t|t|t|t|t
 22|HH|t|t|t|t|t
 23|LL|t|t|t|t|t
 24|LL|t|t|t|t|t
 25|HH|t|t|t|t|t
 26|HH|t|t|t|t|t
 27|LL|t|t|t|t|t
 28|HH|t|t|t|t|t
 29|LL|t|t|t|t|t
 30|LL|t|t|t|t|t
 31|HH|t|t|t|t|t
 32|LL|t|t|t|t|t
 33|HL|t|t|t|t|t
 34|LH|t|t|t|t|t
 35|LL|t|t|t|t|t
 36|LL|t|t|t|t|t
 37|HL|t|t|t|t|t
 38|HL|t|t|t|t|t
 39|HH|t|t|t|t|t
 40|HH|t|t|t|t|t
 41|HL|t|t|t|t|t
 42|LH|t|t|t|t|t
 43|LH|t|t|t|t|t
 44|LL|t|t|t|t|t
 45|LH|t|t|t|t|t
 46|LL|t|t|t|t|t
 47|LL|t|t|t|t|t
 48|HH|t|t|t|t|t
 49|LH|t|t|t|t|t
 50|HH|t|t|t|t|t
 51|LL|t|t|t|t|t
 52|LL|t|t|t|t|t
 (52 rows)
 _cdb_random_seeds
 (1 row)
 code|quads
 01|HH
@ -204,6 +261,63 @@ code|quads
 (52 rows)
 _cdb_random_seeds
 (1 row)
 code|quads|diff_orig|expected|moran_stat_not_null|significance_not_null
 01|HH|t|t|t|t
 02|HL|t|t|t|t
 03|LL|t|t|t|t
 04|LL|t|t|t|t
 05|LH|t|t|t|t
 06|LL|t|t|t|t
 07|HH|t|t|t|t
 08|HH|t|t|t|t
 09|HH|t|t|t|t
 10|LL|t|t|t|t
 11|LL|t|t|t|t
 12|LL|t|t|t|t
 13|HL|t|t|t|t
 14|LL|t|t|t|t
 15|LL|t|t|t|t
 16|HH|t|t|t|t
 17|HH|t|t|t|t
 18|LL|t|t|t|t
 19|HH|t|t|t|t
 20|HH|t|t|t|t
 21|LL|t|t|t|t
 22|HH|t|t|t|t
 23|LL|t|t|t|t
 24|LL|t|t|t|t
 25|HH|t|t|t|t
 26|HH|t|t|t|t
 27|LL|t|t|t|t
 28|HH|t|t|t|t
 29|LL|t|t|t|t
 30|LL|t|t|t|t
 31|HH|t|t|t|t
 32|LL|t|t|t|t
 33|HL|t|t|t|t
 34|LH|t|t|t|t
 35|LL|t|t|t|t
 36|LL|t|t|t|t
 37|HL|t|t|t|t
 38|HL|t|t|t|t
 39|HH|t|t|t|t
 40|HH|t|t|t|t
 41|HL|t|t|t|t
 42|LH|t|t|t|t
 43|LH|t|t|t|t
 44|LL|t|t|t|t
 45|LH|t|t|t|t
 46|LL|t|t|t|t
 47|LL|t|t|t|t
 48|HH|t|t|t|t
 49|LH|t|t|t|t
 50|HH|t|t|t|t
 51|LL|t|t|t|t
 52|LL|t|t|t|t
 (52 rows)
 _cdb_random_seeds
 (1 row)
 code|quads
 01|HH
--- a/src/pg/test/sql/02_moran_test.sql
+++ b/src/pg/test/sql/02_moran_test.sql
@ -24,6 +24,25 @@ SELECT ppoints.code, m.quads
 SELECT cdb_crankshaft._cdb_random_seeds(1234);
 -- Moran's I local
 SELECT
  ppoints.code, m.quads,
  abs(avg(m.orig_val_std) OVER ()) < 1e-6 as diff_orig,
  CASE WHEN m.quads = 'HL' THEN m.orig_val_std > m.spatial_lag_std
       WHEN m.quads = 'HH' THEN m.orig_val_std >= 0 and m.spatial_lag_std >= 0
       WHEN m.quads = 'LH' THEN m.orig_val_std < m.spatial_lag_std
       WHEN m.quads = 'LL' THEN m.orig_val_std <= 0 and m.spatial_lag_std <= 0
       ELSE null END as expected,
  moran_stat is not null moran_stat_not_null,
  significance >= 0.001 significance_not_null, -- greater than 1/1000 (default)
  abs(m.orig_val - ppoints.value) <= 1e-6 as value_comparison
  FROM ppoints
  JOIN cdb_crankshaft.CDB_MoransILocal('SELECT * FROM ppoints', 'value') m
    ON ppoints.cartodb_id = m.rowid
  ORDER BY ppoints.code;
 SELECT cdb_crankshaft._cdb_random_seeds(1234);
 -- Spatial Hotspots
 SELECT ppoints.code, m.quads
  FROM ppoints
@ -61,6 +80,24 @@ SELECT ppoints2.code, m.quads
 SELECT cdb_crankshaft._cdb_random_seeds(1234);
 -- Moran's I local rate
 SELECT
  ppoints2.code, m.quads,
  abs(avg(m.orig_val_std) OVER ()) < 1e-6 as diff_orig,
  CASE WHEN m.quads = 'HL' THEN m.orig_val_std > m.spatial_lag_std
       WHEN m.quads = 'HH' THEN m.orig_val_std >= 0 and m.spatial_lag_std >= 0
       WHEN m.quads = 'LH' THEN m.orig_val_std < m.spatial_lag_std
       WHEN m.quads = 'LL' THEN m.orig_val_std <= 0 and m.spatial_lag_std <= 0
       ELSE null END as expected,
  moran_stat is not null moran_stat_not_null,
  significance >= 0.001 significance_not_null -- greater than 1/1000 (default)
  FROM ppoints2
  JOIN cdb_crankshaft.CDB_MoransILocalRate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
    ON ppoints2.cartodb_id = m.rowid
  ORDER BY ppoints2.code;
 SELECT cdb_crankshaft._cdb_random_seeds(1234);
 -- Spatial Hotspots (rate)
 SELECT ppoints2.code, m.quads
  FROM ppoints2
--- a/src/py/crankshaft/crankshaft/clustering/moran.py
+++ b/src/py/crankshaft/crankshaft/clustering/moran.py
@ -1,21 +1,29 @@
 """
 Moran's I geostatistics (global clustering & outliers presence)
 Functionality relies on a combination of `PySAL
 <http://pysal.readthedocs.io/en/latest/>`__ and the data providered provided in
 the class instantiation (which defaults to PostgreSQL's plpy module's `database
 access functions <https://www.postgresql.org/docs/10/static/plpython.html>`__).
 """
 # TODO: Fill in local neighbors which have null/NoneType values with the
 #       average of the their neighborhood
 import pysal as ps
 from collections import OrderedDict
-from crankshaft.analysis_data_provider import AnalysisDataProvider
+import pysal as ps
 # crankshaft module
 import crankshaft.pysal_utils as pu
 from crankshaft.analysis_data_provider import AnalysisDataProvider
 # High level interface ---------------------------------------
 class Moran(object):
    """Class for calculation of Moran's I statistics (global, local, and local
    rate)
    Parameters:
      data_provider (:obj:`AnalysisDataProvider`): Class for fetching data. See
        the `crankshaft.analysis_data_provider` module for more information.
    """
    def __init__(self, data_provider=None):
        if data_provider is None:
            self.data_provider = AnalysisDataProvider()
@ -28,7 +36,26 @@ class Moran(object):
        Moran's I (global)
        Implementation building neighbors with a PostGIS database and Moran's I
         core clusters with PySAL.
-        Andy Eschbacher
+
        Args:
          subquery (str): Query to give access to the data needed. This query
            must give access to ``attr_name``, ``geom_col``, and ``id_col``.
          attr_name (str): Column name of data to analyze
          w_type (str): Type of spatial weight. Must be one of `knn`
            or `queen`. See `PySAL documentation
            <http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
            for more information.
          num_ngbrs (int): If using `knn` for ``w_type``, this
            specifies the number of neighbors to be used to define the spatial
            neighborhoods.
          permutations (int): Number of permutations for performing
            conditional randomization to find the p-value. Higher numbers
            takes a longer time for getting results.
          geom_col (str): Name of the geometry column in the dataset for
            finding the spatial neighborhoods.
          id_col (str): Row index for each value. Usually the database index.
        """
        params = OrderedDict([("id_col", id_col),
                              ("attr1", attr_name),
@ -53,8 +80,38 @@ class Moran(object):
    def local_stat(self, subquery, attr,
                   w_type, num_ngbrs, permutations, geom_col, id_col):
        """
-        Moran's I implementation for PL/Python
+        Moran's I (local)
-        Andy Eschbacher
+
        Args:
          subquery (str): Query to give access to the data needed. This query
            must give access to ``attr_name``, ``geom_col``, and ``id_col``.
          attr (str): Column name of data to analyze
          w_type (str): Type of spatial weight. Must be one of `knn`
            or `queen`. See `PySAL documentation
            <http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
            for more information.
          num_ngbrs (int): If using `knn` for ``w_type``, this
            specifies the number of neighbors to be used to define the spatial
            neighborhoods.
          permutations (int): Number of permutations for performing
            conditional randomization to find the p-value. Higher numbers
            takes a longer time for getting results.
          geom_col (str): Name of the geometry column in the dataset for
            finding the spatial neighborhoods.
          id_col (str): Row index for each value. Usually the database index.
        Returns:
          list of tuples: Where each tuple consists of the following values:
            - quadrants classification (one of `HH`, `HL`, `LL`, or `LH`)
            - p-value
            - spatial lag
            - standardized spatial lag (centered on the mean, normalized by the
              standard deviation)
            - original value
            - standardized value
            - Moran's I statistic
            - original row index
        """
        # geometries with attributes that are null are ignored
@ -78,13 +135,45 @@ class Moran(object):
        # find quadrants for each geometry
        quads = quad_position(lisa.q)
-        return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)
+        # calculate spatial lag
        lag = ps.weights.spatial_lag.lag_spatial(weight, lisa.y)
        lag_std = ps.weights.spatial_lag.lag_spatial(weight, lisa.z)
        return zip(
            quads,
            lisa.p_sim,
            lag,
            lag_std,
            lisa.y,
            lisa.z,
            lisa.Is,
            weight.id_order
        )
    def global_rate_stat(self, subquery, numerator, denominator,
                         w_type, num_ngbrs, permutations, geom_col, id_col):
        """
        Moran's I Rate (global)
-        Andy Eschbacher
+
        Args:
          subquery (str): Query to give access to the data needed. This query
            must give access to ``attr_name``, ``geom_col``, and ``id_col``.
          numerator (str): Column name of numerator to analyze
          denominator (str): Column name of the denominator
          w_type (str): Type of spatial weight. Must be one of `knn`
            or `queen`. See `PySAL documentation
            <http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
            for more information.
          num_ngbrs (int): If using `knn` for ``w_type``, this
            specifies the number of neighbors to be used to define the spatial
            neighborhoods.
          permutations (int): Number of permutations for performing
            conditional randomization to find the p-value. Higher numbers
            takes a longer time for getting results.
          geom_col (str): Name of the geometry column in the dataset for
            finding the spatial neighborhoods.
          id_col (str): Row index for each value. Usually the database index.
        """
        params = OrderedDict([("id_col", id_col),
                              ("attr1", numerator),
@ -111,7 +200,38 @@ class Moran(object):
                        w_type, num_ngbrs, permutations, geom_col, id_col):
        """
        Moran's I Local Rate
-            Andy Eschbacher
+
        Args:
          subquery (str): Query to give access to the data needed. This query
            must give access to ``attr_name``, ``geom_col``, and ``id_col``.
          numerator (str): Column name of numerator to analyze
          denominator (str): Column name of the denominator
          w_type (str): Type of spatial weight. Must be one of `knn`
            or `queen`. See `PySAL documentation
            <http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
            for more information.
          num_ngbrs (int): If using `knn` for ``w_type``, this
            specifies the number of neighbors to be used to define the spatial
            neighborhoods.
          permutations (int): Number of permutations for performing
            conditional randomization to find the p-value. Higher numbers
            takes a longer time for getting results.
          geom_col (str): Name of the geometry column in the dataset for
            finding the spatial neighborhoods.
          id_col (str): Row index for each value. Usually the database index.
        Returns:
          list of tuples: Where each tuple consists of the following values:
            - quadrants classification (one of `HH`, `HL`, `LL`, or `LH`)
            - p-value
            - spatial lag
            - standardized spatial lag (centered on the mean, normalized by the
              standard deviation)
            - original value (roughly numerator divided by denominator)
            - standardized value
            - Moran's I statistic
            - original row index
        """
        # geometries with values that are null are ignored
        # resulting in a collection of not as near neighbors
@ -138,7 +258,20 @@ class Moran(object):
        # find quadrants for each geometry
        quads = quad_position(lisa.q)
-        return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)
+        # spatial lag
        lag = ps.weights.spatial_lag.lag_spatial(weight, lisa.y)
        lag_std = ps.weights.spatial_lag.lag_spatial(weight, lisa.z)
        return zip(
            quads,
            lisa.p_sim,
            lag,
            lag_std,
            lisa.y,
            lisa.z,
            lisa.Is,
            weight.id_order
        )
    def local_bivariate_stat(self, subquery, attr1, attr2,
                             permutations, geom_col, id_col,
@ -179,9 +312,9 @@ def map_quads(coord):
    """
    Map a quadrant number to Moran's I designation
    HH=1, LH=2, LL=3, HL=4
-        Input:
+    Args:
-        @param coord (int): quadrant of a specific measurement
+      coord (int): quadrant of a specific measurement
-        Output:
+    Returns:
      classification (one of 'HH', 'LH', 'LL', or 'HL')
    """
    if coord == 1:
@ -192,17 +325,17 @@ def map_quads(coord):
        return 'LL'
    elif coord == 4:
        return 'HL'
    else:
    return None
 def quad_position(quads):
    """
-        Produce Moran's I classification based of n
+    Map all quads
-        Input:
+
-        @param quads ndarray: an array of quads classified by
+    Args:
      quads (:obj:`numpy.ndarray`): an array of quads classified by
      1-4 (PySAL default)
-        Output:
+    Returns:
-        @param list: an array of quads classied by 'HH', 'LL', etc.
+      list: an array of quads classied by 'HH', 'LL', etc.
    """
    return [map_quads(q) for q in quads]
--- a/src/py/crankshaft/test/test_clustering_moran.py
+++ b/src/py/crankshaft/test/test_clustering_moran.py
@ -71,10 +71,10 @@ class MoranTest(unittest.TestCase):
        random_seeds.set_random_seeds(1234)
        result = moran.local_stat('subquery', 'value',
                                  'knn', 5, 99, 'the_geom', 'cartodb_id')
-        result = [(row[0], row[1]) for row in result]
+        result = [(row[0], row[6]) for row in result]
        zipped_values = zip(result, self.moran_data)
-        for ([res_val, res_quad], [exp_val, exp_quad]) in zipped_values:
+        for ([res_quad, res_val], [exp_val, exp_quad]) in zipped_values:
            self.assertAlmostEqual(res_val, exp_val)
            self.assertEqual(res_quad, exp_quad)
@ -89,11 +89,11 @@ class MoranTest(unittest.TestCase):
        moran = Moran(FakeDataProvider(data))
        result = moran.local_rate_stat('subquery', 'numerator', 'denominator',
                                       'knn', 5, 99, 'the_geom', 'cartodb_id')
-        result = [(row[0], row[1]) for row in result]
+        result = [(row[0], row[6]) for row in result]
        zipped_values = zip(result, self.moran_data)
-        for ([res_val, res_quad], [exp_val, exp_quad]) in zipped_values:
+        for ([res_quad, res_val], [exp_val, exp_quad]) in zipped_values:
            self.assertAlmostEqual(res_val, exp_val)
    def test_moran(self):