diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 94e63bc..ebf2155 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -32,6 +32,9 @@ follow the[Semantic Versioning 2.0](http://semver.org/) guidelines: - Add new files or modify copies of the old files to add new functions or modify existing functions (remember to rename a function if the signature changes) + - Add or modify the corresponding documentation files in the `doc` folder. + Since we expect to have highly technical functions here, an extense + background explanation would be of great help to users of this extension. - Create tests for the new functions/behaviour * Generate the **upgrade and downgrade files** for the extension diff --git a/pg/doc/02_moran.md b/pg/doc/02_moran.md new file mode 100644 index 0000000..85384cb --- /dev/null +++ b/pg/doc/02_moran.md @@ -0,0 +1,71 @@ +### Moran's I + +#### What is Moran's I and why is it significant for CartoDB? + +Moran's I is a geostatistical calculation which gives a measure of the global +clustering and presence of outliers within the geographies in a map. Here global +means over all of the geographies in a dataset. Imagine mapping the incidence +rates of cancer in neighborhoods of a city. If there were areas covering several +neighborhoods with abnormally low rates of cancer, those areas are positively +spatially correlated with one another and would be considered a cluster. If +there was a single neighborhood with a high rate but with all neighbors on +average having a low rate, it would be considered a spatial outlier. + +While Moran's I gives a global snapshot, there are local indicators for +clustering called Local Indicators of Spatial Autocorrelation. Clustering is a +process related to autocorrelation -- i.e., a process that compares a +geography's attribute to the attribute in neighbor geographies. + +For the example of cancer rates in neighborhoods, since these neighborhoods have +a high value for rate of cancer, and all of their neighbors do as well, they are +designated as "High High" or simply **HH**. For areas with multiple neighborhoods +with low rates of cancer, they are designated as "Low Low" or **LL**. HH and LL +naturally fit into the concept of clustering and are in the correlated +variables. + +"Anticorrelated" geogs are in **LH** and **HL** regions -- that is, regions +where a geog has a high value and it's neighbors, on average, have a low value +(or vice versa). An example of this is a "gated community" or placement of a +city housing project in a rich region. These deliberate developments have +opposite median income as compared to the neighbors around them. They have a +high (or low) value while their neighbors have a low (or high) value. They exist +typically as islands, and in rare circumstances can extend as chains dividing +**LL** or **HH**. + +Strong policies such as rent stabilization (probably) tend to prevent the +clustering of high rent areas as they integrate middle class incomes. Luxury +apartment buildings, which are a kind of gated community, probably tend to skew +an area's median income upwards while housing projects have the opposite effect. +What are the nuggets in the analysis? + +Two functions are available to compute Moran I statistics: + +* `cdb_moran_local` computes Moran I measures, quad classification and + significance values from numerial values associated to geometry entities + in an input table. The geometries should be contiguous polygons When + then `queen` `w_type` is used. +* `cdb_moran_local_rate` computes the same statistics using a ratio between + numerator and denominator columns of a table. + +The parameters for `cdb_moran_local` are: + +* `table` name of the table that contains the data values +* `attr` name of the column +* `signficance` significance threshold for the quads values +* `num_ngbrs` number of neighbors to consider (default: 5) +* `permutations` number of random permutations for calculation of + pseudo-p values (default: 99) +* `geom_column` number of the geometry column (default: "the_geom") +* `id_col` PK column of the table (default: "cartodb_id") +* `w_type` Weight types: can be "knn" for k-nearest neighbor weights + or "queen" for contiguity based weights. + +The function returns a table with the following columns: + +* `moran` Moran's value +* `quads` quad classification ('HH', 'LL', 'HL', 'LH' or 'Not significant') +* `significance` significance value +* `ids` id of the corresponding record in the input table + +Function `cdb_moran_local_rate` only differs in that the `attr` input +parameter is substituted by `numerator` and `denominator`. diff --git a/pg/doc/03_overlap_sum.md b/pg/doc/03_overlap_sum.md new file mode 100644 index 0000000..b3f797e --- /dev/null +++ b/pg/doc/03_overlap_sum.md @@ -0,0 +1,29 @@ +### Aereal Weighting + +Aereal weighting is a simple interpolation technique to assign a value +to a polygon given a set of polygons with one value assigned to each one. + +The value is assigned by averaging the values of intersecting areas +weighted by the intersection area. + +Its accuracy depends on the values assigned to reference areas being +homogeneous over each area. + +The `cdb_overlap_function` takes three required parameters: + +* `geometry` a Polygon geometry which defines the area where a value will be + estimated. +* `table_name`: name of the values table that provides the source values; + this table must have a geometric column `the_geom` containing the polygons + to which values are assigned. +* `column_name`: name of the column that contains the values in the values + table (should be a numeric column) + +There's also an additional optional parameter to define the schema to which +the values table belongs. This is necessary only if it is not in the +`search_path`. Note that `table_name` should never include the schema in it. + +* `schema_name` name of the schema that contains the values table + +This function returns a numeric value resulting from the aggregation +of the polygons in