crankshaft/02_moran.md at similarity_rank

Andy Eschbacher 3f20275d3d adopting new format (wip)

2016-03-23 17:09:52 -04:00

7.3 KiB

Raw Permalink Blame History

Name

CDB_AreasOfInterest -- returns a table with a cluster/outlier classification, the significance of a classification, an autocorrelation statistic (Local Moran's I), and the geometry id for each geometry in the original dataset.

Synopsis

table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name)

table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name, int permutations, text geom_column, text id_column, text weight_type, int num_ngbrs)

Description

CDB_AreasOfInterest is a table-returning function that classifies the geometries in a table by an attribute and gives a significance for that classification. This information can be used to find "Areas of Interest" by using the correlation of a geometry's attribute with that of its neighbors. Areas can be clusters, outliers, or neither (depending on which significance value is used).

Inputs:

query (required): an arbitrary query against tables you have access to (e.g., in your account, shared in your organization, or through the Data Observatory). This string must contain the following columns: an id INT (e.g., cartodb_id), geometry (e.g., the_geom), and the numeric attribute which is specified in column_name
column_name (required): column to perform the area of interest analysis tool on. The data must be numeric (e.g., float, int, etc.)
permutations (optional): used to calculate the significance of a classification. Defaults to 99, which is sufficient in most situations.
geom_column (optional): the name of the geometry column. Data must be of type geometry.
id_column (optional): the name of the id column (e.g., cartodb_id). Data must be of type int or bigint and have a unique condition on the data.
weight_type (optional): the type of weight used for determining what defines a neighborhood. Options are knn or queen.
num_ngbrs (optional): the number of neighbors in a neighborhood around a geometry. Only used if knn is chosen above.

Outputs:

moran_val: underlying correlation statistic used in analysis
quadrant: human-readable interpretation of classification
significance: significance of classification (closer to 0 is more significant)
ids: id of original geometry (used for joining against original table if desired -- see examples)
column_values: original column values from column_name

Availability: crankshaft v0.0.1 and above

Examples

SELECT
  t.the_geom_webmercator,
  t.cartodb_id,
  aoi.significance,
  aoi.quadrant As aoi_quadrant
FROM
  observatory.acs2013 As t
JOIN
  crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013',
                                 'gini_index')

API Usage

Example

http://eschbacher.cartodb.com/api/v2/sql?q=SELECT * FROM crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013','gini_index')

Result

{
  time: 0.120,
  total_rows: 100,
  rows: [{
    moran_vals: 0.7213,
    quadrant: 'High area',
    significance: 0.03,
    ids: 1,
    column_value: 0.22
  },
  {
    moran_vals: -0.7213,
    quadrant: 'Low outlier',
    significance: 0.13,
    ids: 2,
    column_value: 0.03
  },
  ...
  ]
}

Moran's I is a geostatistical calculation which gives a measure of the global clustering and presence of outliers within the geographies in a map. Here global means over all of the geographies in a dataset. Imagine mapping the incidence rates of cancer in neighborhoods of a city. If there were areas covering several neighborhoods with abnormally low rates of cancer, those areas are positively spatially correlated with one another and would be considered a cluster. If there was a single neighborhood with a high rate but with all neighbors on average having a low rate, it would be considered a spatial outlier.

While Moran's I gives a global snapshot, there are local indicators for clustering called Local Indicators of Spatial Autocorrelation. Clustering is a process related to autocorrelation -- i.e., a process that compares a geography's attribute to the attribute in neighbor geographies.

For the example of cancer rates in neighborhoods, since these neighborhoods have a high value for rate of cancer, and all of their neighbors do as well, they are designated as "High High" or simply HH. For areas with multiple neighborhoods with low rates of cancer, they are designated as "Low Low" or LL. HH and LL naturally fit into the concept of clustering and are in the correlated variables.

"Anticorrelated" geogs are in LH and HL regions -- that is, regions where a geog has a high value and it's neighbors, on average, have a low value (or vice versa). An example of this is a "gated community" or placement of a city housing project in a rich region. These deliberate developments have opposite median income as compared to the neighbors around them. They have a high (or low) value while their neighbors have a low (or high) value. They exist typically as islands, and in rare circumstances can extend as chains dividing LL or HH.

Strong policies such as rent stabilization (probably) tend to prevent the clustering of high rent areas as they integrate middle class incomes. Luxury apartment buildings, which are a kind of gated community, probably tend to skew an area's median income upwards while housing projects have the opposite effect. What are the nuggets in the analysis?

Two functions are available to compute Moran I statistics:

cdb_moran_local computes Moran I measures, quad classification and significance values from numerial values associated to geometry entities in an input table. The geometries should be contiguous polygons When then queen w_type is used.
cdb_moran_local_rate computes the same statistics using a ratio between numerator and denominator columns of a table.

The parameters for cdb_moran_local are:

table name of the table that contains the data values
attr name of the column
signficance significance threshold for the quads values
num_ngbrs number of neighbors to consider (default: 5)
permutations number of random permutations for calculation of pseudo-p values (default: 99)
geom_column number of the geometry column (default: "the_geom")
id_col PK column of the table (default: "cartodb_id")
w_type Weight types: can be "knn" for k-nearest neighbor weights or "queen" for contiguity based weights.

The function returns a table with the following columns:

moran Moran's value
quads quad classification ('HH', 'LL', 'HL', 'LH' or 'Not significant')
significance significance value
ids id of the corresponding record in the input table

Function cdb_moran_local_rate only differs in that the attr input parameter is substituted by numerator and denominator.

7.3 KiB

Raw Permalink Blame History

Name

Synopsis

Description

Examples

API Usage

See Also

What is Moran's I and why is it significant for CartoDB?

7.3 KiB Raw Permalink Blame History

Name

Synopsis

Description

Examples

API Usage

See Also

What is Moran's I and why is it significant for CartoDB?

7.3 KiB

Raw Permalink Blame History