在PostgreSQL中优化查询,它试图匹配字符串并匹配时间戳范围 -- postgresql 领域 和 optimization 领域 和 postgresql-9.4 领域 dba 相关 的问题

Optimizing query in PostgreSQL that tries to match a string and matches a timestamp range


1
vote

问题

中文

我在PostgreSQL中构建了一个数据库,用于财务数据,表格如下所示:

  create table fin_data(     brokerid text,     stock int,      holding bigint,      stake float,      value bigint,      price float,      holding_time tstzrange,      unique(brokerid, stock, holding, holding_time));   

这似乎已经创建了一个默认索引: fin_data_brokerid_stock_holding_holding_time_key

如何组织数据:每个"股票" 有100-500'经纪人的任何地方,每天持有一定的股票#。我有大约9500万行。

我的主查询的示例是:

  select *  from fin_data  where brokerid='C00019'  AND holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone;   

基本上意味着对于指定的代理,我想了解某些经纪人在某一天持有多少分享。

不幸的是,这不是一个非常快的查询,因为我的解释分析看起来像这样:

  Bitmap Heap Scan on fin_data       (cost=115762.08..1362596.53 rows=14079 width=64)      (actual time=571.648..1729.416 rows=1840 loops=1)    Recheck Cond: (brokerid = 'C00019'::text)       Rows Removed by Index Recheck: 116323     Filter: (holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone)   Rows Removed by Filter: 2054316 ->  Bitmap Index Scan on fin_data_brokerid_stock_holding_holding_time_key    (cost=0.00..115758.56 rows=1982116 width=0)    (actual time=569.477..569.477 rows=2056156 loops=1)          Index Cond: (brokerid = 'C00019'::text) Total runtime: 1729.933 ms (8 rows)   

运行时为1.73秒。

这可以很容易地为某些经纪人持续2-3秒,并且有点可接受,但如果可能的话,我希望它在1秒以下。

奇怪的是,当我稍后只使索引(库存,持有时间)索引时,其中一些查询将在Sub-100ms范围内,通常是100-300毫秒。分析可能看起来像:

  Index Scan using stock_date_index on fin_data      (cost=0.57..790795.62 rows=3136 width=64)      (actual time=3.825..411.985 rows=644 loops=1)     Index Cond: (stock = 5)     Filter: (holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone)     Rows Removed by Filter: 447426   Total runtime: 412.123 ms (5 rows)   

是更理想的。

对数据的难怪事实是每股股票可能只有500名经纪人,但有些经纪人可能有多达1000多件股票。这是我询问的唯一原因是如此慢得多?我如何优化这个?

英文原文

I am building a database in PostgreSQL for financial data, where the table looks like this:

create table fin_data(     brokerid text,     stock int,      holding bigint,      stake float,      value bigint,      price float,      holding_time tstzrange,      unique(brokerid, stock, holding, holding_time)); 

This seems to already create a default index called: fin_data_brokerid_stock_holding_holding_time_key

How the data is organized: Each 'stock' has anywhere from 100-500 'brokers' holding a certain # of shares per day. I have around 95 million rows.

Example of my main query is:

select *  from fin_data  where brokerid='C00019'  AND holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone; 

Which basically means that for a specified broker, I want to find out how many shares a certain broker holds on a certain day.

Unfortunately, this is not a very fast query, as my explain analyze looks something like this:

Bitmap Heap Scan on fin_data       (cost=115762.08..1362596.53 rows=14079 width=64)      (actual time=571.648..1729.416 rows=1840 loops=1)    Recheck Cond: (brokerid = 'C00019'::text)       Rows Removed by Index Recheck: 116323     Filter: (holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone)   Rows Removed by Filter: 2054316 ->  Bitmap Index Scan on fin_data_brokerid_stock_holding_holding_time_key    (cost=0.00..115758.56 rows=1982116 width=0)    (actual time=569.477..569.477 rows=2056156 loops=1)          Index Cond: (brokerid = 'C00019'::text) Total runtime: 1729.933 ms (8 rows) 

Runtime is 1.73 seconds.

This could easily go up to 2-3 seconds for certain brokers, and is somewhat acceptable but I want it to be under 1 second if possible.

The strange thing is, when I later made an index on (stock, holding_time) only, some of those queries would be in the sub-100ms range, usually 100-300 ms. The analysis might look like:

Index Scan using stock_date_index on fin_data      (cost=0.57..790795.62 rows=3136 width=64)      (actual time=3.825..411.985 rows=644 loops=1)     Index Cond: (stock = 5)     Filter: (holding_time @> '2015-09-28 00:00:00+08'::timestamp with time zone)     Rows Removed by Filter: 447426   Total runtime: 412.123 ms (5 rows) 

Which is much more desirable.

A hard fact about the data is that each stock might only have up to 500 brokers, but some brokers could have up to 1000+ stocks. Is that the only reason my queries are so much slower? How do I optimize this?

        
       
       

回答列表

1
 
vote

默认B树索引不能在范围数据类型上非常智能地操作。你想要的是一个在范围内的GIST索引,所以类似的东西:

  create index on fin_data using gist (holding_time);   

对于您的主查询,这可能会提供一个位图和计划,它将使用一个索引来获取满足Broker_id的行指针列表,另一个索引获取满足Have_time的列表,然后在访问之前获取列表的交集桌子。这可能更快。这是最灵活的情况,因为GIST索引可以与您拥有的任何其他索引组合使用。如果这不够快,您可以尝试 create extension btree_gist 然后制作索引:

  create index on fin_data using gist (broker_id, holding_time);   

对于其他查询变体,这对您被识别为您的主查询的查询非常灵活。

请注意,GIST索引初始构建比B-Tree索引更长。当插入新行时,它们也是较慢的,因此如果您的系统努力跟上插入率,您将首先进行基准。


另一个不涉及制作更多索引的想法,是在查询中选择您需要的列,而不是'*'。这可能允许索引 - 仅在其否则不能的地方使用扫描。

 

Default B-Tree indexes can't operate very intelligently on range data types. What you want is a GiST index on the range, so something like:

create index on fin_data using gist (holding_time); 

For your main query, this would probably give a BitmapAnd plan which would use one index to get the list of row pointers satisfying broker_id and the other to get the list satisfying holding_time, and then take the intersection of the lists before visiting the table. This could be quite a bit faster. This is the most flexible case, because that gist index can be used in combination with any other indexes you have. If that is not fast enough, you could try create extension btree_gist and then making the index:

create index on fin_data using gist (broker_id, holding_time); 

This would be less flexible for other query variants, but ideal for the query you identified as your main query.

Note that GiST indexes take a lot longer to initially build than B-Tree indexes. They are also are slower to maintain when new rows are inserted, so if your system struggles to keep up with insertion rate, you would want to benchmark it first.


Another idea which doesn't involve making more indexes, is to select just the columns you need rather than '*' in your queries. That might allow index-only scans to be used where they otherwise cannot be.

 
 

相关问题

1  搜索阵列性能?  ( Searching in array performance ) 
我们有一个的表 id|school_id|parent_ids 其中 parent_ids 是一个ID数组。 如果我们没有 school_id 并且只有 parent_id 来搜索,那么查询将通过 parent_ids 数组,可能存在数千行,而 parent_id 实际上可能在其中的少数内。 在此情况下,阵...

2  每项值的重复[复制]  ( Numerate duplicate of every value ) 
这个问题已经在这里有答案: 用升值列更新列 (2个答案) 关闭 20天前。...

2  在一个字段中显示带有非唯一值的行  ( Show rows with non unique value in one field ) 
我熟悉如何聚合行,如此答案所示: 仅显示重复值 我也熟悉如何使用具有vall子句过滤聚合结果。 我似乎无法掌握(以至于它粘)是如何基于值或比较其他行的行,没有聚合它们。 我知道答案涉及关于窗口函数或窗口条款的东西,实际上我之前已经成功完成了。但似乎并不粘在我的脑海中如何工作;我觉得我错过了一些根本的东西。 给...

0  postgreSQL故障恢复没有重新同步数据目录  ( Postgresql failback without resync the data directory ) 
IM在PostgreSQL中使用主从STEALING复制。 repmgr配置为处理自动故障转移。 一旦发生故障转移,我需要将旧主机作为从属没有做任何basebackup的旧主站,所以新的主要需要一些更改,就像需要启用WAL日志等的主人一样。 如何避免在旧主站和同步旧大师的次级中避免这些更改而没有BaseBackup...

0  PostgreSQL使用从SELECT开始或使用中选择的数据?  ( Postgresql utilize data from select in begin or use with ) 
我通过了文档为9.3 ,没有查找任何建议我可以或不能在 BEGIN 查询中使用来自 SELECT 查询的数据。是否可能或者我必须求助于 WITH 查询某种类型? 我正在尝试将两个查询合并到单个 BEGIN 查询中,以弄清楚如何在向数据库执行查询时更高效。我有另一个项目这个兔子落后于我建立了一个快速重复表的工具的工具(...

4  在Postgres触发器中分配给新键  ( Assign to new by key in a postgres trigger ) 
在触发器体中,如何将值分配给 NEW 的字段名称? 这就是我想做的: some_key = "some_column"; NEW[some_key] = 5; ...

0  使用函数比运行查询要慢得多  ( Using function is much slower than running query ) 
我正在尝试编写一个plpgsql函数,它运行一个查询,其中 PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime)) (PARTITION p2014103112 VALUES LESS THAN (73590212) ENGINE = InnoDB, ......

1  用postgres_fdw(外国数据包装器)分析PostgreSQL  ( Sharding postgresql with postgres fdw foreign data wrapper ) 
我们希望通过租户(客户端ID)将单个PostgreSQL 10.2数据库分离给多个服务器。最简单的方法之一是使用外国数据包装器(Postgres_fdw扩展)。 它看起来像: 我们有一个"主机" 和具有相同架构的几个数据节点。主节点具有日志表,替换为视图。视图由外国表格的工会组成。查看要对特定服务器的转发操作进行插入...

0  大中查询所需的性能改进  ( Performance improvement needed for query with large in ) 
我有一个相对简单的表: CREATE TABLE t_balances ( f_index BIGINT NOT NULL ,f_epoch BIGINT NOT NULL ,f_balance BIGINT NOT NULL ); CREATE UNIQU...

0  由于SQL_IDENTIFIER数据类型,无法升级到v12  ( Cannot upgrade to v12 due to sql identifier data type ) 
我试图在AWS RDS实例上升级Postgres 11.8至12.3。升级失败了以下错误: 您的安装包含用户表中的"SQL_IDENTIFIER" 数据类型 和/或索引。此数据类型的磁盘格式已更改,因此此 群集当前不能升级。您可以删除问题表或 将数据类型更改为"名称" 并重新启动升级。 问题列列表位于文件中: ta...

3  postgres,如何将“0”值插入序列?  ( Postgres how insert 0 value into serial ) 
我在我的应用程序表中有一个 abcdefghijklmnserial id。但是,我想插入一个 0 ID记录到均值全局。 有没有办法,我可以将此插入我的桌子而不影响计数器? ...

3  启动pgbouncer  ( Starting pgbouncer on startup ) 
我安装了pgbouncer并配置了它,它只通过使用此命令开始 pgbouncer /etc/pgbouncer/pgbouncer.ini 但是使用服务pgbouncer启动它不会启动。然后我将summervious设置为汇集它的工作正常。 [program:pgbouncer] command=pgb...

16  PostgreSQL中的(x不为null)vs(不是x为null)  ( X is not null vs not x is null in postgresql ) 
为什么 information_schema3 不等于 information_schema4 ? 本代码: information_schema5 给出以下输出: information_schema6 虽然我希望得到这个输出: information_schema7 ...

0  PostgreSQL是否应安装在用于多用户访问的服务器计算机上? [关闭]  ( Should postgresql be installed on a server machine for multi user access ) 
关闭。这个问题是 off-topic 。它目前不接受答案。 想要改进这个问题?更新问题,所以它是主题用于数据库管理员堆栈交换。 关闭 3年前。 ...

1  postgres row_to_json数字类型的精度  ( Postgres row to json precision for numeric type ) 
我正在使用row_to_json函数将表行转换为JSON对象。 表中有几列是 numeric 的类型为 abcdefghijklmn0 。 当Row_to_json返回JSON时,如9.98变为9.98000000000000043或6.30在这些数字列中变为6.2999999999999982。 是否有一种方法可...




© 2021 it.wenda123.org All Rights Reserved. 问答之家 版权所有


Licensed under cc by-sa 3.0 with attribution required.