Merge pull request #202 from CartoDB/iss201-add-lag-to-moran-i

creates new moran's i functions which expose spatial lag
This commit is contained in:
Andy Eschbacher 2018-03-12 11:21:14 -04:00 committed by GitHub
commit 7cf84133b8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 579 additions and 46 deletions

View File

@ -1,3 +1,7 @@
0.8.0 (yyyy-mm-dd)
------------------
* Adds `CDB_MoransILocal*` functions that return spatial lag [#202](https://github.com/CartoDB/crankshaft/pull/202)
0.7.0 (2018-02-23)
------------------
* Updated Moran and Markov documentation [#179](https://github.com/CartoDB/crankshaft/pull/179) [#155](https://github.com/CartoDB/crankshaft/pull/155)

View File

@ -1,4 +1,6 @@
## Areas of Interest Functions
## Moran's I - Spatial Autocorrelation
Note: these functions are replacing the functions in the _Areas of Interest_ family (still documented below). `CDB_MoransILocal` and `CDB_MoransILocalRate` perform the same analysis as their `CDB_AreasOfInterest*` counterparts but return spatial lag information, which is needed for creating the Moran's I scatter plot. It recommended to use the `CDB_MoransILocal*` variants instead as they will be maintained and improved going foward.
A family of analyses to uncover groupings of areas with consistently high or low values (clusters) and smaller areas with values unlike those around them (outliers). A cluster is labeled by an 'HH' (high value compared to the entire dataset in an area with other high values), or its opposite 'LL'. An outlier is labeled by an 'LH' (low value surrounded by high values) or an 'HL' (the opposite). Each cluster and outlier classification has an associated p-value, a measure of how significant the pattern of highs and lows is compared to a random distribution.
@ -9,7 +11,108 @@ These functions have two forms: local and global. The local versions classify ev
* Rows with null values will be omitted from this analysis. To ensure they are added to the analysis, fill the null-valued cells with an appropriate value such as the mean of a column, the mean of the most recent two time steps, or use a `LEFT JOIN` to get null outputs from the analysis.
* Input query can only accept tables (datasets) in the users database account. Common table expressions (CTEs) do not work as an input unless specified within the `subquery` argument.
### CDB_AreasOfInterestLocal(subquery text, column_name text)
### CDB_MoransILocal(subquery text, column_name text)
This function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. The classification happens through an autocorrelation statistic called Local Moran's I.
#### Arguments
| Name | Type | Description |
|------|------|-------------|
| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
| column_name | TEXT | Name of column (e.g., should be `'interesting_value'` instead of `interesting_value` without single quotes) used for the analysis. |
| weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
| num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
| permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
| geom_col (optional) | TEXT | The column name for the geometries. Defaults to `'the_geom'` |
| id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `'cartodb_id'`. |
#### Returns
A table with the following columns.
| Column Name | Type | Description |
|-------------|------|-------------|
| quads | TEXT | Classification of geometry. Result is one of 'HH' (a high value with neighbors high on average), 'LL' (opposite of 'HH'), 'HL' (a high value surrounded by lows on average), and 'LH' (opposite of 'HL'). Null values are returned when nulls exist in the original data. |
| significance | NUMERIC | The statistical significance (from 0 to 1) of a cluster or outlier classification. Lower numbers are more significant. |
| spatial\_lag | NUMERIC | The 'average' of the neighbors of the value in this row. The average is calculated from it's neighborhood -- defined by `weight_type`. |
| spatial\_lag\_std | NUMERIC | The standardized version of `spatial_lag` -- that is, centered on the mean and divided by the standard deviation. Useful as the y-axis in a Moran's I scatter plot. |
| orig\_val | NUMERIC | Values from `column_name`. |
| orig\_val\_std | NUMERIC | Values from `column_name` but centered on the mean and divided by the standard devation. Useful as the x-axis in Moran's I scatter plots. |
| moran\_stat | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the geometry with id of `rowid` |
| rowid | INT | Row id of the values which correspond to the input rows. |
#### Example Usage
```sql
SELECT
c.the_geom,
m.quads,
m.significance,
c.num_cyclists_per_total_population
FROM
cdb_crankshaft.CDB_MoransILocal(
'SELECT * FROM commute_data'
'num_cyclists_per_total_population') As m
JOIN commute_data As c
ON c.cartodb_id = m.rowid;
```
### CDB_MoransILocalRate(subquery text, numerator text, denominator text)
Just like `CDB_MoransILocal`, this function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. This function differs in that it calculates the classifications based on input `numerator` and `denominator` columns for finding the areas where there are clusters and outliers for the resulting rate of those two values.
#### Arguments
| Name | Type | Description |
|------|------|-------------|
| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
| numerator | TEXT | Name of the numerator for forming a rate to be used in analysis. |
| denominator | TEXT | Name of the denominator for forming a rate to be used in analysis. |
| weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
| num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
| permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
| geom_col (optional) | TEXT | The column name for the geometries. Defaults to `the_geom` |
| id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `cartodb_id`. |
#### Returns
A table with the following columns.
| Column Name | Type | Description |
|-------------|------|-------------|
| quads | TEXT | Classification of geometry. Result is one of 'HH' (a high value with neighbors high on average), 'LL' (opposite of 'HH'), 'HL' (a high value surrounded by lows on average), and 'LH' (opposite of 'HL'). Null values are returned when nulls exist in the original data. |
| significance | NUMERIC | The statistical significance (from 0 to 1) of a cluster or outlier classification. Lower numbers are more significant. |
| spatial\_lag | NUMERIC | The 'average' of the neighbors of the value in this row. The average is calculated from it's neighborhood -- defined by `weight_type`. |
| spatial\_lag\_std | NUMERIC | The standardized version of `spatial_lag` -- that is, centered on the mean and divided by the standard deviation. |
| orig\_val | NUMERIC | Standardized rate (centered on the mean and normalized by the standard deviation) calculated from `numerator` and `denominator`. This is calculated by [Assuncao Rate](http://pysal.readthedocs.io/en/latest/library/esda/smoothing.html?highlight=assuncao#pysal.esda.smoothing.assuncao_rate) in the PySAL library. |
| orig\_val\_std | NUMERIC | Values from `column_name` but centered on the mean and divided by the standard devation. Useful as the x-axis in Moran's I scatter plots. |
| moran\_stat | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the geometry with id of `rowid` |
| rowid | INT | Row id of the values which correspond to the input rows. |
A table with the following columns. |
#### Example Usage
```sql
SELECT
c.the_geom,
m.quads,
m.significance,
c.cyclists_per_total_population
FROM
cdb_crankshaft.CDB_MoransILocalRate(
'SELECT * FROM commute_data'
'num_cyclists',
'total_population') As m
JOIN commute_data As c
ON c.cartodb_id = m.rowid;
```
### CDB_AreasOfInterestLocal(subquery text, column_name text) (deprecated)
This function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. The classification happens through an autocorrelation statistic called Local Moran's I.
@ -55,7 +158,7 @@ JOIN commute_data As c
ON c.cartodb_id = aoi.rowid;
```
### CDB_AreasOfInterestGlobal(subquery text, column_name text)
### CDB_AreasOfInterestGlobal(subquery text, column_name text) (deprecated)
This function identifies the extent to which geometries cluster (the groupings of geometries with similarly high or low values relative to the mean) or form outliers (areas where geometries have values opposite of their neighbors). The output of this function gives values between -1 and 1 as well as a significance of that classification. Values close to 0 mean that there is little to no distribution of values as compared to what one would see in a randomly distributed collection of geometries and values.
@ -91,7 +194,7 @@ FROM
'num_cyclists_per_total_population')
```
### CDB_AreasOfInterestLocalRate(subquery text, numerator_column text, denominator_column text)
### CDB_AreasOfInterestLocalRate(subquery text, numerator_column text, denominator_column text) (deprecated)
Just like `CDB_AreasOfInterestLocal`, this function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. This function differs in that it calculates the classifications based on input `numerator` and `denominator` columns for finding the areas where there are clusters and outliers for the resulting rate of those two values.
@ -138,7 +241,7 @@ JOIN commute_data As c
ON c.cartodb_id = aoi.rowid;
```
### CDB_AreasOfInterestGlobalRate(subquery text, column_name text)
### CDB_AreasOfInterestGlobalRate(subquery text, column_name text) (deprecated)
This function identifies the extent to which geometries cluster (the groupings of geometries with similarly high or low values relative to the mean) or form outliers (areas where geometries have values opposite of their neighbors). The output of this function gives values between -1 and 1 as well as a significance of that classification. Values close to 0 mean that there is little to no distribution of values as compared to what one would see in a randomly distributed collection of geometries and values.
@ -178,7 +281,7 @@ FROM
## Hotspot, Coldspot, and Outlier Functions
These functions are convenience functions for extracting only information that you are interested in exposing based on the outputs of the `CDB_AreasOfInterest` functions. For instance, you can use `CDB_GetSpatialHotspots` to output only the classifications of `HH` and `HL`.
These functions are convenience functions for extracting only information that you are interested in exposing based on the outputs of the `CDB_MoransI*` functions. For instance, you can use `CDB_GetSpatialHotspots` to output only the classifications of `HH` and `HL`.
### Non-rate functions

View File

@ -17,7 +17,7 @@ AS $$
num_ngbrs, permutations, geom_col, id_col)
$$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local (internal function)
-- Moran's I Local (internal function) - DEPRECATED
CREATE OR REPLACE FUNCTION
_CDB_AreasOfInterestLocal(
subquery TEXT,
@ -27,16 +27,82 @@ CREATE OR REPLACE FUNCTION
permutations INT,
geom_col TEXT,
id_col TEXT)
RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
RETURNS TABLE (
moran NUMERIC,
quads TEXT,
significance NUMERIC,
rowid INT,
vals NUMERIC)
AS $$
from crankshaft.clustering import Moran
moran = Moran()
# TODO: use named parameters or a dictionary
return moran.local_stat(subquery, column_name, w_type,
result = moran.local_stat(subquery, column_name, w_type,
num_ngbrs, permutations, geom_col, id_col)
# remove spatial lag
return [(r[6], r[0], r[1], r[7], r[5]) for r in result]
$$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local (internal function)
CREATE OR REPLACE FUNCTION
_CDB_MoransILocal(
subquery TEXT,
column_name TEXT,
w_type TEXT,
num_ngbrs INT,
permutations INT,
geom_col TEXT,
id_col TEXT)
RETURNS TABLE (
quads TEXT,
significance NUMERIC,
spatial_lag NUMERIC,
spatial_lag_std NUMERIC,
orig_val NUMERIC,
orig_val_std NUMERIC,
moran_stat NUMERIC,
rowid INT)
AS $$
from crankshaft.clustering import Moran
moran = Moran()
return moran.local_stat(subquery, column_name, w_type,
num_ngbrs, permutations, geom_col, id_col)
$$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local (public-facing function)
-- Replaces CDB_AreasOfInterestLocal
CREATE OR REPLACE FUNCTION
CDB_MoransILocal(
subquery TEXT,
column_name TEXT,
w_type TEXT DEFAULT 'knn',
num_ngbrs INT DEFAULT 5,
permutations INT DEFAULT 99,
geom_col TEXT DEFAULT 'the_geom',
id_col TEXT DEFAULT 'cartodb_id')
RETURNS TABLE (
quads TEXT,
significance NUMERIC,
spatial_lag NUMERIC,
spatial_lag_std NUMERIC,
orig_val NUMERIC,
orig_val_std NUMERIC,
moran_stat NUMERIC,
rowid INT)
AS $$
SELECT
quads, significance, spatial_lag, spatial_lag_std,
orig_val, orig_val_std, moran_stat, rowid
FROM cdb_crankshaft._CDB_MoransILocal(
subquery, column_name, w_type,
num_ngbrs, permutations, geom_col, id_col);
$$ LANGUAGE SQL VOLATILE PARALLEL UNSAFE;
-- Moran's I Local (public-facing function) - DEPRECATED
CREATE OR REPLACE FUNCTION
CDB_AreasOfInterestLocal(
subquery TEXT,
@ -132,7 +198,7 @@ AS $$
$$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local Rate (internal function)
-- Moran's I Local Rate (internal function) - DEPRECATED
CREATE OR REPLACE FUNCTION
_CDB_AreasOfInterestLocalRate(
subquery TEXT,
@ -144,15 +210,22 @@ CREATE OR REPLACE FUNCTION
geom_col TEXT,
id_col TEXT)
RETURNS
TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
TABLE(
moran NUMERIC,
quads TEXT,
significance NUMERIC,
rowid INT,
vals NUMERIC)
AS $$
from crankshaft.clustering import Moran
moran = Moran()
# TODO: use named parameters or a dictionary
return moran.local_rate_stat(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
result = moran.local_rate_stat(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
# remove spatial lag
return [(r[6], r[0], r[1], r[7], r[4]) for r in result]
$$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Local Rate (public-facing function)
-- Moran's I Local Rate (public-facing function) - DEPRECATED
CREATE OR REPLACE FUNCTION
CDB_AreasOfInterestLocalRate(
subquery TEXT,
@ -172,6 +245,75 @@ AS $$
$$ LANGUAGE SQL VOLATILE PARALLEL UNSAFE;
-- Internal function
CREATE OR REPLACE FUNCTION
_CDB_MoransILocalRate(
subquery TEXT,
numerator TEXT,
denominator TEXT,
w_type TEXT,
num_ngbrs INT,
permutations INT,
geom_col TEXT,
id_col TEXT)
RETURNS
TABLE(
quads TEXT,
significance NUMERIC,
spatial_lag NUMERIC,
spatial_lag_std NUMERIC,
orig_val NUMERIC,
orig_val_std NUMERIC,
moran_stat NUMERIC,
rowid INT)
AS $$
from crankshaft.clustering import Moran
moran = Moran()
return moran.local_rate_stat(
subquery,
numerator,
denominator,
w_type,
num_ngbrs,
permutations,
geom_col,
id_col
)
$$ LANGUAGE plpythonu VOLATILE PARALLEL UNSAFE;
-- Moran's I Rate
-- Replaces CDB_AreasOfInterestLocalRate
CREATE OR REPLACE FUNCTION
CDB_MoransILocalRate(
subquery TEXT,
numerator TEXT,
denominator TEXT,
w_type TEXT DEFAULT 'knn',
num_ngbrs INT DEFAULT 5,
permutations INT DEFAULT 99,
geom_col TEXT DEFAULT 'the_geom',
id_col TEXT DEFAULT 'cartodb_id')
RETURNS
TABLE(
quads TEXT,
significance NUMERIC,
spatial_lag NUMERIC,
spatial_lag_std NUMERIC,
orig_val NUMERIC,
orig_val_std NUMERIC,
moran_stat NUMERIC,
rowid INT)
AS $$
SELECT
quads, significance, spatial_lag, spatial_lag_std,
orig_val, orig_val_std, moran_stat, rowid
FROM cdb_crankshaft._CDB_MoransILocalRate(
subquery, numerator, denominator, w_type,
num_ngbrs, permutations, geom_col, id_col);
$$ LANGUAGE SQL VOLATILE PARALLEL UNSAFE;
-- Moran's I Local Rate only for HH and HL (public-facing function)
CREATE OR REPLACE FUNCTION
CDB_GetSpatialHotspotsRate(

View File

@ -68,6 +68,63 @@ code|quads
(52 rows)
_cdb_random_seeds
(1 row)
code|quads|diff_orig|expected|moran_stat_not_null|significance_not_null|value_comparison
01|HH|t|t|t|t|t
02|HL|t|t|t|t|t
03|LL|t|t|t|t|t
04|LL|t|t|t|t|t
05|LH|t|t|t|t|t
06|LL|t|t|t|t|t
07|HH|t|t|t|t|t
08|HH|t|t|t|t|t
09|HH|t|t|t|t|t
10|LL|t|t|t|t|t
11|LL|t|t|t|t|t
12|LL|t|t|t|t|t
13|HL|t|t|t|t|t
14|LL|t|t|t|t|t
15|LL|t|t|t|t|t
16|HH|t|t|t|t|t
17|HH|t|t|t|t|t
18|LL|t|t|t|t|t
19|HH|t|t|t|t|t
20|HH|t|t|t|t|t
21|LL|t|t|t|t|t
22|HH|t|t|t|t|t
23|LL|t|t|t|t|t
24|LL|t|t|t|t|t
25|HH|t|t|t|t|t
26|HH|t|t|t|t|t
27|LL|t|t|t|t|t
28|HH|t|t|t|t|t
29|LL|t|t|t|t|t
30|LL|t|t|t|t|t
31|HH|t|t|t|t|t
32|LL|t|t|t|t|t
33|HL|t|t|t|t|t
34|LH|t|t|t|t|t
35|LL|t|t|t|t|t
36|LL|t|t|t|t|t
37|HL|t|t|t|t|t
38|HL|t|t|t|t|t
39|HH|t|t|t|t|t
40|HH|t|t|t|t|t
41|HL|t|t|t|t|t
42|LH|t|t|t|t|t
43|LH|t|t|t|t|t
44|LL|t|t|t|t|t
45|LH|t|t|t|t|t
46|LL|t|t|t|t|t
47|LL|t|t|t|t|t
48|HH|t|t|t|t|t
49|LH|t|t|t|t|t
50|HH|t|t|t|t|t
51|LL|t|t|t|t|t
52|LL|t|t|t|t|t
(52 rows)
_cdb_random_seeds
(1 row)
code|quads
01|HH
@ -204,6 +261,63 @@ code|quads
(52 rows)
_cdb_random_seeds
(1 row)
code|quads|diff_orig|expected|moran_stat_not_null|significance_not_null
01|HH|t|t|t|t
02|HL|t|t|t|t
03|LL|t|t|t|t
04|LL|t|t|t|t
05|LH|t|t|t|t
06|LL|t|t|t|t
07|HH|t|t|t|t
08|HH|t|t|t|t
09|HH|t|t|t|t
10|LL|t|t|t|t
11|LL|t|t|t|t
12|LL|t|t|t|t
13|HL|t|t|t|t
14|LL|t|t|t|t
15|LL|t|t|t|t
16|HH|t|t|t|t
17|HH|t|t|t|t
18|LL|t|t|t|t
19|HH|t|t|t|t
20|HH|t|t|t|t
21|LL|t|t|t|t
22|HH|t|t|t|t
23|LL|t|t|t|t
24|LL|t|t|t|t
25|HH|t|t|t|t
26|HH|t|t|t|t
27|LL|t|t|t|t
28|HH|t|t|t|t
29|LL|t|t|t|t
30|LL|t|t|t|t
31|HH|t|t|t|t
32|LL|t|t|t|t
33|HL|t|t|t|t
34|LH|t|t|t|t
35|LL|t|t|t|t
36|LL|t|t|t|t
37|HL|t|t|t|t
38|HL|t|t|t|t
39|HH|t|t|t|t
40|HH|t|t|t|t
41|HL|t|t|t|t
42|LH|t|t|t|t
43|LH|t|t|t|t
44|LL|t|t|t|t
45|LH|t|t|t|t
46|LL|t|t|t|t
47|LL|t|t|t|t
48|HH|t|t|t|t
49|LH|t|t|t|t
50|HH|t|t|t|t
51|LL|t|t|t|t
52|LL|t|t|t|t
(52 rows)
_cdb_random_seeds
(1 row)
code|quads
01|HH

View File

@ -24,6 +24,25 @@ SELECT ppoints.code, m.quads
SELECT cdb_crankshaft._cdb_random_seeds(1234);
-- Moran's I local
SELECT
ppoints.code, m.quads,
abs(avg(m.orig_val_std) OVER ()) < 1e-6 as diff_orig,
CASE WHEN m.quads = 'HL' THEN m.orig_val_std > m.spatial_lag_std
WHEN m.quads = 'HH' THEN m.orig_val_std >= 0 and m.spatial_lag_std >= 0
WHEN m.quads = 'LH' THEN m.orig_val_std < m.spatial_lag_std
WHEN m.quads = 'LL' THEN m.orig_val_std <= 0 and m.spatial_lag_std <= 0
ELSE null END as expected,
moran_stat is not null moran_stat_not_null,
significance >= 0.001 significance_not_null, -- greater than 1/1000 (default)
abs(m.orig_val - ppoints.value) <= 1e-6 as value_comparison
FROM ppoints
JOIN cdb_crankshaft.CDB_MoransILocal('SELECT * FROM ppoints', 'value') m
ON ppoints.cartodb_id = m.rowid
ORDER BY ppoints.code;
SELECT cdb_crankshaft._cdb_random_seeds(1234);
-- Spatial Hotspots
SELECT ppoints.code, m.quads
FROM ppoints
@ -61,6 +80,24 @@ SELECT ppoints2.code, m.quads
SELECT cdb_crankshaft._cdb_random_seeds(1234);
-- Moran's I local rate
SELECT
ppoints2.code, m.quads,
abs(avg(m.orig_val_std) OVER ()) < 1e-6 as diff_orig,
CASE WHEN m.quads = 'HL' THEN m.orig_val_std > m.spatial_lag_std
WHEN m.quads = 'HH' THEN m.orig_val_std >= 0 and m.spatial_lag_std >= 0
WHEN m.quads = 'LH' THEN m.orig_val_std < m.spatial_lag_std
WHEN m.quads = 'LL' THEN m.orig_val_std <= 0 and m.spatial_lag_std <= 0
ELSE null END as expected,
moran_stat is not null moran_stat_not_null,
significance >= 0.001 significance_not_null -- greater than 1/1000 (default)
FROM ppoints2
JOIN cdb_crankshaft.CDB_MoransILocalRate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
ON ppoints2.cartodb_id = m.rowid
ORDER BY ppoints2.code;
SELECT cdb_crankshaft._cdb_random_seeds(1234);
-- Spatial Hotspots (rate)
SELECT ppoints2.code, m.quads
FROM ppoints2

View File

@ -1,21 +1,29 @@
"""
Moran's I geostatistics (global clustering & outliers presence)
Functionality relies on a combination of `PySAL
<http://pysal.readthedocs.io/en/latest/>`__ and the data providered provided in
the class instantiation (which defaults to PostgreSQL's plpy module's `database
access functions <https://www.postgresql.org/docs/10/static/plpython.html>`__).
"""
# TODO: Fill in local neighbors which have null/NoneType values with the
# average of the their neighborhood
import pysal as ps
from collections import OrderedDict
from crankshaft.analysis_data_provider import AnalysisDataProvider
import pysal as ps
# crankshaft module
import crankshaft.pysal_utils as pu
from crankshaft.analysis_data_provider import AnalysisDataProvider
# High level interface ---------------------------------------
class Moran(object):
"""Class for calculation of Moran's I statistics (global, local, and local
rate)
Parameters:
data_provider (:obj:`AnalysisDataProvider`): Class for fetching data. See
the `crankshaft.analysis_data_provider` module for more information.
"""
def __init__(self, data_provider=None):
if data_provider is None:
self.data_provider = AnalysisDataProvider()
@ -28,7 +36,26 @@ class Moran(object):
Moran's I (global)
Implementation building neighbors with a PostGIS database and Moran's I
core clusters with PySAL.
Andy Eschbacher
Args:
subquery (str): Query to give access to the data needed. This query
must give access to ``attr_name``, ``geom_col``, and ``id_col``.
attr_name (str): Column name of data to analyze
w_type (str): Type of spatial weight. Must be one of `knn`
or `queen`. See `PySAL documentation
<http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
for more information.
num_ngbrs (int): If using `knn` for ``w_type``, this
specifies the number of neighbors to be used to define the spatial
neighborhoods.
permutations (int): Number of permutations for performing
conditional randomization to find the p-value. Higher numbers
takes a longer time for getting results.
geom_col (str): Name of the geometry column in the dataset for
finding the spatial neighborhoods.
id_col (str): Row index for each value. Usually the database index.
"""
params = OrderedDict([("id_col", id_col),
("attr1", attr_name),
@ -53,8 +80,38 @@ class Moran(object):
def local_stat(self, subquery, attr,
w_type, num_ngbrs, permutations, geom_col, id_col):
"""
Moran's I implementation for PL/Python
Andy Eschbacher
Moran's I (local)
Args:
subquery (str): Query to give access to the data needed. This query
must give access to ``attr_name``, ``geom_col``, and ``id_col``.
attr (str): Column name of data to analyze
w_type (str): Type of spatial weight. Must be one of `knn`
or `queen`. See `PySAL documentation
<http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
for more information.
num_ngbrs (int): If using `knn` for ``w_type``, this
specifies the number of neighbors to be used to define the spatial
neighborhoods.
permutations (int): Number of permutations for performing
conditional randomization to find the p-value. Higher numbers
takes a longer time for getting results.
geom_col (str): Name of the geometry column in the dataset for
finding the spatial neighborhoods.
id_col (str): Row index for each value. Usually the database index.
Returns:
list of tuples: Where each tuple consists of the following values:
- quadrants classification (one of `HH`, `HL`, `LL`, or `LH`)
- p-value
- spatial lag
- standardized spatial lag (centered on the mean, normalized by the
standard deviation)
- original value
- standardized value
- Moran's I statistic
- original row index
"""
# geometries with attributes that are null are ignored
@ -78,13 +135,45 @@ class Moran(object):
# find quadrants for each geometry
quads = quad_position(lisa.q)
return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)
# calculate spatial lag
lag = ps.weights.spatial_lag.lag_spatial(weight, lisa.y)
lag_std = ps.weights.spatial_lag.lag_spatial(weight, lisa.z)
return zip(
quads,
lisa.p_sim,
lag,
lag_std,
lisa.y,
lisa.z,
lisa.Is,
weight.id_order
)
def global_rate_stat(self, subquery, numerator, denominator,
w_type, num_ngbrs, permutations, geom_col, id_col):
"""
Moran's I Rate (global)
Andy Eschbacher
Args:
subquery (str): Query to give access to the data needed. This query
must give access to ``attr_name``, ``geom_col``, and ``id_col``.
numerator (str): Column name of numerator to analyze
denominator (str): Column name of the denominator
w_type (str): Type of spatial weight. Must be one of `knn`
or `queen`. See `PySAL documentation
<http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
for more information.
num_ngbrs (int): If using `knn` for ``w_type``, this
specifies the number of neighbors to be used to define the spatial
neighborhoods.
permutations (int): Number of permutations for performing
conditional randomization to find the p-value. Higher numbers
takes a longer time for getting results.
geom_col (str): Name of the geometry column in the dataset for
finding the spatial neighborhoods.
id_col (str): Row index for each value. Usually the database index.
"""
params = OrderedDict([("id_col", id_col),
("attr1", numerator),
@ -111,7 +200,38 @@ class Moran(object):
w_type, num_ngbrs, permutations, geom_col, id_col):
"""
Moran's I Local Rate
Andy Eschbacher
Args:
subquery (str): Query to give access to the data needed. This query
must give access to ``attr_name``, ``geom_col``, and ``id_col``.
numerator (str): Column name of numerator to analyze
denominator (str): Column name of the denominator
w_type (str): Type of spatial weight. Must be one of `knn`
or `queen`. See `PySAL documentation
<http://pysal.readthedocs.io/en/latest/users/tutorials/weights.html>`__
for more information.
num_ngbrs (int): If using `knn` for ``w_type``, this
specifies the number of neighbors to be used to define the spatial
neighborhoods.
permutations (int): Number of permutations for performing
conditional randomization to find the p-value. Higher numbers
takes a longer time for getting results.
geom_col (str): Name of the geometry column in the dataset for
finding the spatial neighborhoods.
id_col (str): Row index for each value. Usually the database index.
Returns:
list of tuples: Where each tuple consists of the following values:
- quadrants classification (one of `HH`, `HL`, `LL`, or `LH`)
- p-value
- spatial lag
- standardized spatial lag (centered on the mean, normalized by the
standard deviation)
- original value (roughly numerator divided by denominator)
- standardized value
- Moran's I statistic
- original row index
"""
# geometries with values that are null are ignored
# resulting in a collection of not as near neighbors
@ -138,7 +258,20 @@ class Moran(object):
# find quadrants for each geometry
quads = quad_position(lisa.q)
return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)
# spatial lag
lag = ps.weights.spatial_lag.lag_spatial(weight, lisa.y)
lag_std = ps.weights.spatial_lag.lag_spatial(weight, lisa.z)
return zip(
quads,
lisa.p_sim,
lag,
lag_std,
lisa.y,
lisa.z,
lisa.Is,
weight.id_order
)
def local_bivariate_stat(self, subquery, attr1, attr2,
permutations, geom_col, id_col,
@ -179,9 +312,9 @@ def map_quads(coord):
"""
Map a quadrant number to Moran's I designation
HH=1, LH=2, LL=3, HL=4
Input:
@param coord (int): quadrant of a specific measurement
Output:
Args:
coord (int): quadrant of a specific measurement
Returns:
classification (one of 'HH', 'LH', 'LL', or 'HL')
"""
if coord == 1:
@ -192,17 +325,17 @@ def map_quads(coord):
return 'LL'
elif coord == 4:
return 'HL'
else:
return None
def quad_position(quads):
"""
Produce Moran's I classification based of n
Input:
@param quads ndarray: an array of quads classified by
Map all quads
Args:
quads (:obj:`numpy.ndarray`): an array of quads classified by
1-4 (PySAL default)
Output:
@param list: an array of quads classied by 'HH', 'LL', etc.
Returns:
list: an array of quads classied by 'HH', 'LL', etc.
"""
return [map_quads(q) for q in quads]

View File

@ -71,10 +71,10 @@ class MoranTest(unittest.TestCase):
random_seeds.set_random_seeds(1234)
result = moran.local_stat('subquery', 'value',
'knn', 5, 99, 'the_geom', 'cartodb_id')
result = [(row[0], row[1]) for row in result]
result = [(row[0], row[6]) for row in result]
zipped_values = zip(result, self.moran_data)
for ([res_val, res_quad], [exp_val, exp_quad]) in zipped_values:
for ([res_quad, res_val], [exp_val, exp_quad]) in zipped_values:
self.assertAlmostEqual(res_val, exp_val)
self.assertEqual(res_quad, exp_quad)
@ -89,11 +89,11 @@ class MoranTest(unittest.TestCase):
moran = Moran(FakeDataProvider(data))
result = moran.local_rate_stat('subquery', 'numerator', 'denominator',
'knn', 5, 99, 'the_geom', 'cartodb_id')
result = [(row[0], row[1]) for row in result]
result = [(row[0], row[6]) for row in result]
zipped_values = zip(result, self.moran_data)
for ([res_val, res_quad], [exp_val, exp_quad]) in zipped_values:
for ([res_quad, res_val], [exp_val, exp_quad]) in zipped_values:
self.assertAlmostEqual(res_val, exp_val)
def test_moran(self):