本文共 11383 字,大约阅读时间需要 37 分钟。
聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
- 指标聚合metric:是对一个数据集求最大、最小、和、平均值等指标的聚合
- 桶聚合bucketing:关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合,在 ES 中group by 称为分桶
- ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。
在查询请求体中以aggregations节点按如下语法定义聚合分析(aggregations可以简写成aggs):
"aggregations" : { "" : { " " : { } [,"meta" : { [ ] } ]? [,"aggregations" : { [ ]+ } ]? } [," " : { ... } ]*}
聚合计算的值可以取字段的值,也可是脚本计算的结果。
查询所有客户中余额最大值(size=0表示不返回其他字段):
POST /bank/_search?{ "size": 0, "aggs": { "masssbalance": { "max": { "field": "balance" } } }}
年龄为24岁的客户中余额最大值:
POST /bank/_search?{ "size": 2, "query": { "match": { "age": 24 } }, "sort": [ { "balance": { "order": "desc" } } ], "aggs": { "max_balance": { "max": { "field": "balance" } } }}
查询所有客户的平均年龄是多少(值来源于脚本):
POST /bank/_search?size=0{ "aggs" : { "avg_age" : { "avg" : { "script" : { "source" : "doc.age.value" } } }, "avg_age10" : { "avg" : { "script" : { "source" : "doc.age.value + 10" } } } }}
指定字段field,然后在脚本中用_value取字段的值:
POST /bank/_search?size=0{ "aggs": { "sum_balance": { "sum": { "field": "balance", "script": { "source": "_value * 1.03" } } } }}
为缺失字段指定值,如未指定,缺失字段的值将被忽略:
POST /bank/_search?size=0{ "aggs": { "avg_age": { "avg": { "field": "age", "missing": 18 } } }}
文档计数count:
POST /bank/_doc/_count{ "query": { "match": { "age" : 24 } }}
cardinality值去重计数:
POST /bank/_search?size=0{ "aggs": { "age_count": { "cardinality": { "field": "age" } }, "state_count": { "cardinality": { "field": "state.keyword" } } }}
统计某字段有值的文档数:
POST /bank/_search?size=0{ "aggs" : { "age_count" : { "value_count" : { "field" : "age" } } }}
stats可以统计count、max、min、avg、sum5个值:
POST /bank/_search?size=0{ "aggs": { "age_stats": { "stats": { "field": "age" } } }}
高级统计,比stats多4个统计结果:平方和、方差、标准差、平均值加/减两个标准差的区间
POST /bank/_search?size=0{ "aggs": { "age_stats": { "extended_stats": { "field": "age" } } }}
对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%
POST /bank/_search?size=0{ "aggs": { "age_percents": { "percentiles": { "field": "age" } } }}#返回结果 "aggregations": { "age_percents": { "values": { "1.0": 20, "5.0": 21, "25.0": 25, "50.0": 31, "75.0": 35, "95.0": 39, "99.0": 40 } } }
也可以指定分位值:
POST /bank/_search?size=0{ "aggs": { "age_percents": { "percentiles": { "field": "age", "percents" : [95, 99, 99.9] } } }}#结果"aggregations": { "age_percents": { "values": { "95.0": 39, "99.0": 40, "99.9": 40 } }}
POST /bank/_search?size=0{ "aggs": { "gge_perc_rank": { "percentile_ranks": { "field": "age", "values": [ 25, 30 ] } } }}#结果"aggregations": { "gge_perc_rank": { "values": { "25.0": 26.1, "30.0": 49.3 } } }
参考官网:
参考官网:
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age" #根据age值项进行分组聚合 } } }}#返回结果"aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, #文档计数的最大偏差值 "sum_other_doc_count": 463, #未返回的其他项的文档数 "buckets": [ { "key": 31, #age的值 "doc_count": 61 #出现的文档总数 }, { "key": 39, "doc_count": 60 }, { "key": 26, "doc_count": 59 }, …. ] }}
默认情况下返回按文档计数从高到低的前10个分组
size可以指定返回多少个分组
shard_size可以指定每个分片上返回多少个分组,默认值如下:
- 索引只有一个分片的情况下,shard_size=size
- 索引有多个分片的情况下,shard_size=size*1.5+10
show_term_doc_count_error可以指定每个分组上是否显示偏差值
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "size": 5, "shard_size":20, "show_term_doc_count_error": true } } }}
order可以指定根据文档计数排序或根据分组值排序
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "order" : { "_count" : "asc" } #根据文档计数排序 } } }}POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "order" : { "_key" : "asc" } #根据分组值排序 } } }}
取分组指标值,比如按年龄age分组,然后显示出该年龄的最小收入balance和最大收入balance:
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "order": { "max_balance": "asc" } }, "aggs": { "max_balance": { "max": { "field": "balance" } }, "min_balance": { "min": { "field": "balance" } } } } }}#返回结果"aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 511, "buckets": [ { "key": 27, "doc_count": 39, "min_balance": { "value": 1110 }, "max_balance": { "value": 46868 } }, { "key": 39, "doc_count": 60, "min_balance": { "value": 3589 }, "max_balance": { "value": 47257 } }, ..... ] } }
根据分组指标值排序,比如按最大收入进行排序
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "order": { "max_balance": "asc" } }, "aggs": { "max_balance": { "max": { "field": "balance" } } } } }}
还可以统计收入的最大、最小、平均、总数,并按照任意一个值进行排序:
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "order": { "stats_balance.max": "asc" } }, "aggs": { "stats_balance": { "stats": { "field": "balance" } } } } }}
筛选分组,可以过滤文档计数最小值达到多少,还可以筛选指定的key值列表:
POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "min_doc_count": 60 #文档数60或以上的显示出来 } } }}POST /bank/_search?size=0{ "aggs": { "age_terms": { "terms": { "field": "age", "include": [20,24] #只显示年龄为20和24的数据 } } }}
还可以指定字段中包含或不包含哪些内容,或者使用正则表达式进行匹配值:
GET /_search{ "aggs" : { "JapaneseCars" : { "terms" : { "field" : "make", "include" : ["mazda", "honda"] #make中包含这些字段的 } }, "ActiveCarManufacturers" : { "terms" : { "field" : "make", "exclude" : ["rover", "jensen"] #make中不包含这些字段的 } } }}GET /_search{ "aggs" : { "tags" : { "terms" : { "field" : "tags", "include" : ".*sport.*", "exclude" : "water_.*" } } }}
对缺失值处理,比如有的文档中tags字段是不存在或没有值的,那么我们可以为这些字段指定这种情况下应该返回什么纸:
GET /_search{ "aggs" : { "tags" : { "terms" : { "field" : "tags", "missing": "N/A" } } }}
在查询命中的文档中选取符合过滤条件的文档进行聚合
POST /bank/_search?size=0{ "aggs": { "age_terms": { "filter": { "match":{ "gender":"F"}}, "aggs": { "avg_age": { "avg": { "field": "age" } } } } }}
索引一段数据:
PUT /logs/_doc/_bulk?refresh{ "index" : { "_id" : 1 } }{ "body" : "warning: page could not be rendered" }{ "index" : { "_id" : 2 } }{ "body" : "authentication error" }{ "index" : { "_id" : 3 } }{ "body" : "warning: connection timed out" }
然后进行多个过滤组统计查询
GET logs/_search{ "size": 0, "aggs" : { "messages" : { "filters" : { "filters" : { "errors" : { "match" : { "body" : "error" }}, "warnings" : { "match" : { "body" : "warning" }} } } } }}
POST /bank/_search?size=0{ "aggs": { "age_range": { "range": { "field": "age", "ranges": [ { "to":25}, { "from": 25,"to": 35}, { "from": 35} ] }, "aggs": { "bmax": { "max": { "field": "balance" } } } } }}#返回结果,分成三组,to、from to、from"aggregations": { "age_range": { "buckets": [ { "key": "*-25.0", "to": 25, "doc_count": 225, "bmax": { "value": 49587 } }, { "key": "25.0-35.0", "from": 25, "to": 35, "doc_count": 485, "bmax": { "value": 49795 } }, { "key": "35.0-*", "from": 35, "doc_count": 290, "bmax": { "value": 49989 } } ] } }
POST /sales/_search?size=0{ "aggs": { "range": { "date_range": { "field": "date", "format": "MM-yyy", "ranges": [ { "to": "now-10M/M" }, { "from": "now-10M/M" } ] } } }}
就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。
POST /sales/_search?size=0{ "aggs" : { "sales_over_time" : { "date_histogram" : { "field" : "date", "interval" : "month" } } }}POST /sales/_search?size=0{ "aggs" : { "sales_over_time" : { "date_histogram" : { "field" : "date", "interval" : "90m" } } }}
指定缺失字段值的文档作为一个桶进行聚合分析
POST /bank/_search?size=0{ "aggs" : { "account_without_a_age" : { "missing" : { "field" : "age" } } }}
参考官网:
转载地址:http://kwpxi.baihongyu.com/