数据库设计建议数据刮擦/仓库应用程序? -- mysql 领域 和 database-design 领域 和 optimization 领域 和 database-recommendation 领域 和 data-warehouse 领域 dba 相关 的问题

Database design suggestions for a data scraping/warehouse application?


1
vote

问题

中文

我正在研究数据仓库类型的数据库设计,涉及每天大量的插入。数据归档将进一步用于生成报告。我将有一个用户列表(例如用户设置的200万),我需要监控与它们相关的日常社交网络活动。

例如,让一组100用户说U1,U2,...,U100。

我需要将他们的日常状态计数插入到我的数据库中。

考虑对用户U1的总状态计数于6月30日 - 7月6日,如下所示,如下所示

  June 30 - 99 July 1 - 100 July 2 - 102 July 3 - 102 July 4 - 105 July 5 - 105 July 6 - 107   

数据库应保留每个用户的日常状态计数,如用户U1:

  July 1- 1 (100-99) July 2- 2 (102-100)  July 3- 0 (102-102)  July 4- 3 (105-102)  July 5- 0 (105-105)  July 6- 2 (107-105)    

同样,数据库应保留全套用户的日常细节。

以及在后期,我设想将总报告从这些数据中取出,如在每天,周,月,月份等的总点;并将其与较旧的数据进行比较。

我需要从头开始启动东西。我遇到了PHP作为服务器端脚本和MySQL。我在数据库方面混淆了。由于我需要每天处理大约一百万的插入,因此所有应该处理的东西?

我很困惑如何在这方面设计一个mysql数据库。使用哪些存储引擎以及应遵循哪些设计模式,请记住数据可以稍后有效地使用聚合功能?

目前,我设想DB设计,一个表格存储所有用户ID的所有用户ID,每天都有单独的状态计数表。

mysql是否适合我的要求?每天完成200万或更多的DB操作。在这种情况下,服务器和其他要考虑的其他东西是如何考虑的?

编辑:

涉及查询:

插入查询

插入查询应每天能够插入1-2百万插入。 (我们在此处没有更新。)

retreival查询

1.整套用户的状态。

2.在地理位置下用户集合的状态。

3.与天/周/月份的patering状态计数。

- >我相信在这种情况下需要某种索引,但我读取索引可能会减慢插入。

- >我也听说过Myisam比InnoDB考虑速度方面的更好选择。

请告知?

英文原文

I'm looking into the database design for a data warehouse kind of project which involves a large number of inserts daily. The data archives will be further used to generate reports. I will have a list of users (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.

For example, let there be a set of 100 users say U1, U2, ..., U100.

I need to insert their daily status count into my database.

Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows

June 30 - 99 July 1 - 100 July 2 - 102 July 3 - 102 July 4 - 105 July 5 - 105 July 6 - 107 

The database should keep daily status count of each user, like for user U1:

July 1- 1 (100-99) July 2- 2 (102-100)  July 3- 0 (102-102)  July 4- 3 (105-102)  July 5- 0 (105-105)  July 6- 2 (107-105)  

Similarly the database should hold daily details of the full set of users.

And on a later phase I envision taking aggregate reports out of these data, like total points scored on each day, week, month, etc; and to compare it with older data.

I need to start things from scratch. I am experienced with PHP as a server side script and MySQL. I am confused on the database side. Since I need to process about a million insertions daily, what are all the things that should be taken care of?

I am confused on how to design a MySQL database in this regard. Which storage engine to use and which design patterns should be followed, keeping in mind the data could later used effectively with aggregate functions?

Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.

Does MySQL fit my requirement? 2 million or more DB operations are done every day. How are the server and other things to be considered in this case?

EDIT:

Queries Involved:

INSERTION QUERIES

Insertion queries should be capable of inserting 1-2 million inserts every day. (We don't have update here.)

RETREIVAL QUERIES

1.Sum of statuses for the whole set of users.

2.Sum of statuses for set of users under a geographic location.

3.Comparing status counts with days/weeks/months.

-->I believe some kind of indexes are needed in this case, but I read indexes could slow down insertion.

-->Also I have heard MyISAM would be a better choice than InnoDB considering speed aspects.

Please advise?

              
   
   

回答列表

2
 
vote

这些是一般性建议,因为您没有显示要执行的查询的全部范围(您计划执行的哪种分析)。

假设您不需要实时结果,您应该只能在期间结束时对数据进行超规范化,一旦您的聚合结果,在一天,按星期,并仅用摘要工作表。根据您打算做的查询,您甚至可能甚至不需要原始数据。

如果耐用性不是问题(您可以始终将统计信息重新计算为RAW数据),可以使用缓存机制(外部,或MySQL 5.6包含Memcache),这适合写入和读取键值数据记忆。

使用分区(也可以手动完成),与这些应用程序一样,通常最常见的行也是最近的。删除或存档到其他表以有效地使用我们的内存。

使用InnoDB如果您希望耐用,高并发写入和最常用的访问数据将适合内存。还有Tokudb-它可能不会更快地加工,但是在处理巨大的高表目的插入时,它会更好地缩放,并允许压缩磁盘。还有一个分析的引擎,如 Infobright 。< / p>

编辑:

23插入/秒在任何带有坏磁盘的存储器中都可以是可行的,但是:

  • 您不想使用myisam-它不能做并发写入(在非常具体的条件外),您不希望拥有损坏和丢失数据的巨大表

  • innodb默认情况下完全耐用,有关更好的性能,您可能希望减少耐用性或具有良好的后端(磁盘缓存)。 InnoDB往往会在插入巨大的桌子上变慢。巨大的定义是"主键/其他唯一索引的上部必须适合缓冲池" 以检查UNIQ性。这可以根据可用的内存而变化。如果您希望超出您必须分区的可扩展性(如上所述)/分片或使用之前提到的任何替代引擎(Tokudb)。

SUM() 统计信息在普通mysql引擎上不缩放。索引再次增加性能,因为大多数操作都可以在内存中完成,但在单个线程中仍然必须读取每行的一个条目。我提到了设计替代方案(摘要表,缓存)和替代引擎(基于列)作为解决方案。但如果你不需要实时结果,但报告类似的查询,你不应该太担心那个。

我建议您使用假数据进行快速负载测试。我有许多客户在没有问题的情况下在社交网络的MySQL上进行分析(至少,在我帮助他们之后:-)),但你的决定可能取决于你的实际非功能要求。

 

These are general recommendations, as you do not show the full extent of your queries to be performed (which kind of analytics you plan to do).

Assuming you do not need real time results, you should just denormalize your data at the end of the period, precalculate once your aggregated results on all necessary timeframes -by day, by week, by month-, and work only with summary tables. Depending on the queries you intend to do, you may not even need the original data.

If durability is not a problem (you can always recalculate statistics as raw data is elsewhere), you can use a caching mechanism (external, or MySQL 5.6 includes memcache), which works great for writing and reading key-value data on memory.

Use partitioning (can also be done manually), as with these kind of applications, usually the most frequently accessed rows are also the most recent. Delete or archive old rows to other tables to use our memory efficiently.

Use Innodb if you want durability, high concurrent writes and your most frequent accessed data is going to fit into memory. There is also TokuDB- it may not be raw faster, but it scales better when dealing with insertions on huge, tall tables and allows for compression on disk. There are also analytic-focused engines like Infobright.

Edit:

23 insertions/second is feasible in any storage with a bad disk but:

  • You do not want to use MyISAM- it cannot do concurrent writes (except on very specific conditions) and you do not want to have huge tables that become corrupted and lose data

  • InnoDB is fully durable by default, for better performance you may want to reduce the durability or have a good backend (disk caches). InnoDB tends to get slower on insertion with huge tables. The definition of huge is "the upper parts of the Primary key/other unique indexes must fit into the buffer pool" to check for uniqness. That can vary depending on the memory available. If you want scalability beyond that you have to partition (as I suggested above)/shard or use any of the alternative engines I mentioned before (TokuDB).

SUM() statistics do not scale on normal MySQL engines. An index increases performance, again, because most of the operations can be done on-memory, but one entry for each row has to still be read, in a single thread. I mentioned design alternatives (summary tables, caching) and alternative engines (column-based) as a solution to that. But if you do not need real-time result, but report-like queries, you shouldn't worry too much about that.

I suggest you to do a quick load test with fake data. I've had many clients doing analytics on MySQL of social networks without problems (well, at least, after I helped them :-) ), but you decision may depend on your actual non-functional requisites.

 
 
 
 
0
 
vote

除了jynus说:确保您的表格在 Date 首先是物理群集。这将使范围扫描非常高效,因此聚合长达数周或数月将是快速的。即使您选择在摘要表中实例化这些星期或月级总计,截至日期的聚类也将通过使更新非常快。

这种情况 - 许多范围扫描 - 是您在高基团(用户ID)上选择低基数字段的重要例子。然而,您仍然需要一个索引 UserID

  CREATE TABLE Activity ( Date        DATE NOT NULL, UserID      INT NOT NULL REFERENCES Users(UserID), PRIMARY KEY (Date, UserID), NumUpdates  TINYINT UNSIGNED  -- Assuming that a user cannot update more than 255 times per day; alternately, consider SMALLINT )   
 

In addition to what Jynus said: be sure your table is physically clustered on Date first. This will make range scans very efficient, so aggregation up to weeks or months will be fast. Even if you choose to instantiate these week- or month-level totals in summary tables, clustering by date will help by making the updates very quick.

This kind of situation - many range scans - is an excellent example of where you'd choose a low-cardinality field over a high-cardinality one (the user ID). You will still want an index on UserID, however.

CREATE TABLE Activity ( Date        DATE NOT NULL, UserID      INT NOT NULL REFERENCES Users(UserID), PRIMARY KEY (Date, UserID), NumUpdates  TINYINT UNSIGNED  -- Assuming that a user cannot update more than 255 times per day; alternately, consider SMALLINT ) 
 
 

相关问题

0  表加入并创建新表非常慢  ( Table join and create new table extremely slow ) 
我正在使用这样的东西来创建一个新表: CREATE TABLE result AS (SELECT calls.*, targ_with_planned_calls.* FROM calls INNER JOIN planned ON calls.firs...

1  MySQL改善查询语法  ( Mysql improve query syntax ) 
我有这个简单的mysql查询: SELECT * FROM table1 INNER JOIN table2 ON table2.col = table1.id WHERE table1.id = '123' id 是 table1 的主要键。 table2 具有两列的复合主键: col 和 col...

2  mysql慢慢查询巨大的表格  ( Mysql slow query on huge table ) 
我有巨大的桌子(约50亿行)。 此查询首次运行大约需要3分钟。创建临时表也需要近3分钟。另一个运行约1.5秒。 我们在where子句中测试了索引的每个索引组合,而没有任何影响。 SELECT UNIX_TIMESTAMP(ts) as `timestamp`, SUM(s200) FR...

1  骆驼盒是否有安全的列?  ( Is camel case for column names safe ) 
我有一个数据库,列名为骆驼套。没有问题遇到过。 根据一些情况,这将是一个问题,根据最近的MySQL,Postgres? ...

0  使用SELECT子句使用SUM计算售票销售  ( Calculate ticket sales by using sum in select clause ) 
在 dbfiddle ,我正在尝试计算通过将票价和该事件销售的门票总数乘以IE标准,VIP,通过乘以票价和为该事件销售的门票总数来进行特定事件的票据销售。 到目前为止,我有这个查询, SELECT tt.ticket_type,COUNT(tt.ticket_type) AS 'tickets_paid' , S...

0  是否有替代孤立级别可序列化?  ( Is there an alternative to isolation level serializable ) 
我使用mysql 5.7并希望在产品表上运行导出任务。该过程基本上运行如下: SELECT * FROM products WHERE exported = 0 循环在我的程序中设置并将其写入文件 UPDATE products SET exported = 1 WHERE exported = 0 ...

1  如何解决InnoDB群集冲突  ( How to resolve innodb cluster conflict ) 
我目前与三个节点维持InnoDB群集。它运行良好,但有时一个节点将是 MISSING 然后我必须再次在线联机。 问题是我在没有主键的情况下插入了一个表。然后节点失败。 当我想重新加入失败节点到群集时,它表示它不能加入,因为没有主键有表。我更改了群集中的表,以给它主键,故障节点仍然抱怨相同。所以我删除了失败节点中的表,...

0  查询大于表 - mysql和php  ( Query larger than table mysql and php ) 
我只用几个记录的DB工作,只比赛。 DB有几列;每行的记录和行/记录中的每个列的字段。 什么是正确的查询? DB:电影 表:主要 mysql&gt;选择标题,类型,GENRE2,GENRE3,ACTOR1,ACTOR2,ACTOR3 MAIN; + ----------------- + ---------- ...

1  确定复制流量  ( Determining replication traffic ) 
是否有任何方法可以从服务器统计信息中确定,运行mysql,流量多少(每秒比特)将复制2个服务器之间的消耗,而无需在这两个服务器之间建立实际复制? *统计信息=随着时间的推移,磁盘IO命令,每秒磁盘IO,流量,RAM CPU ...(例如CollectD RRD数据) 谢谢。 :) ...

0  如何从5.0(320位窗口)迁移到5.6(64位窗口)?  ( How to do mysql migration from 5 0 320bit windows to 5 6 64 bit windows ) 
我希望将我的mysql 5.0 Windows 32位迁移到MySQL 5.6 Windows 64位? 请告诉我迁移的最佳方式,我在mysql 5.0中有大约10GB的数据。 ...

-1  设置订单查询结果显示  ( Set the order query results display ) 
我需要显示我 emailType 在 DESC 订单中。现在它是 varchar 字段,所以我需要它是反向字母顺序z - a。当我运行我的查询时,它显示在 ASC 订单中,自我使用a UNION 我不能添加 ORDER BY 子句。 我应该对此查询进行哪些变化,因此我能够设置 SELECT ... FROM...

2  恢复Innodb到不同的机器  ( Restore innodb to different machine ) 
我目前有备份是用我所有MySQL数据库的percona xtrabackup拍摄的备份。 我的具体方案是我想将一个月大的Innodb数据库快照恢复到我的本地机器进行测试,但我似乎无法找到在Percona文档中执行此操作的说明。 在Google上阅读我得出的结论是,我必须在我的机器上提取 abcdefghijklmnt...

0  字符集'utf8mb4'不是编译字符集,未在'/usr/share/mysql/charsets/index.xml'文件中指定  ( Character set utf8mb4 is not a compiled character set and is not specified in ) 
完全错误 Character set 'utf8mb4' is not a compiled character set and is not specified in the '/usr/share/mysql/charsets/Index.xml' file InvalidArgumentException:...

5  我应该使用除Myisam以外的存储引擎来优化这些表格,还是应该得到更好的磁盘?  ( Should i use a storage engine other than myisam to optimise these tables or shou ) 
我正在研究的生产数据库,其4个最大表包含400万和1000万行,每个行为每个和大约15个字段,具有不同的字段类型(数字,varchar和文本)上的索引。总 index_length 是2.2gb, data_length 是6GB。所有这些表都使用Myisam,它们具有66/33的读/写比率。 这是空闲时间的/etc...

0  重置MySQL密码[已关闭]  ( Resetting mysql password ) 
很难讲述这里被问到的内容。这个问题含糊不清,模糊,不完整,过于广泛的或修辞,不能以目前的形式合理地回答。有关帮助澄清此问题,以便可以重新开放,访问Help Center 。 ...




© 2021 it.wenda123.org All Rights Reserved. 问答之家 版权所有


Licensed under cc by-sa 3.0 with attribution required.