January 29, 2008

How MySQL Query Cache works with Transactions

As MySQL Manual Says Query Cache works with transactions with Innodb tables but it does not tell you how and with which restrictions.

According to my tests it works but it is very restricted and one could expect it to work much better:

The result set can be retrieved from query cache (for statements both inside and outside of transactions) until there is a statement inside transactions which modifies the table. As soon as table is modified in transaction it becomes uncachable by query cache until that transaction is committed. Not only query cache can’t be used inside the same transaction which modified data but also in other concurrent transactions which do not even see the changes done yet

Of course such implementation is rather restricted. Queries outside of transactions well could use query cache until it is invalidated by committed transaction, however it was probably too hard to implement using current query cache infrastructure. With current approach Innodb can probably do something as simple as marking table “uncachable” if it has any uncommitted changed which would take care about all complicated aspects of change visibility in different transaction modes.

In most cases this limitation should not cause many problems (compared to general coarse table base invalidation it does) - only in case of long uncommitted transactions you will get data being uncachable for concurrent workload for a long time.


Entry posted by peter |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(MySQL Query Cache and prepared statements - support comming ?) (Beware large Query_Cache sizes) (Getting real life query speeds with MySQL) (MySQL Query Cache WhiteSpace and comments) (Query Profiling with MySQL: Bypassing caches

No more MySQL Crash Safe Replication in 5.0 ?

As you might know even if you’re only using Innodb tables your replication is not completely crash safe - if Slave MySQL Server crashes/power goes down it is likely for relay logs to run out of sync (they are not synced to the disk) plus position on the master which slave remembers becomes stale.

During MySQL 4.0 and 4.1 series there was a great workaround if you’re using only Innodb tables - Innodb when Innodb does crash recovery it would print position in master log files up to which replication was done:

SQL:

  1. InnoDB: IN a MySQL replication slave the last master binlog file
  2. InnoDB: position 0 115, file name portland-bin.001717

All you needed to do is to use –skip-slave-start on the slave server and have a little script which will do CHANGE MASTER TO to specified location to restore replication in case of crash (assuming you’re only using Innodb tables of course)

Another way this functionality was usable is cloning Slave->Slave by use of LVM without pausing replication (so you can get consistent master position).

It is all great but it does not work any more in MySQL 5.0 Baron has spotted it by incident when we were verifying some of examples for High Performance MySQL book.

In the bug Heikki explains the code was probably removed in MySQL 5.0 during XA implementation though the code which prints the data back on recovery was not, so it prints you some log file name and position but they have nothing to do with real position on the master anymore.

I hope Innodb team will find a way to restore this functionality or at least remove confusing message which leaves impression this thing works.

Until this issue is fixed getting Crash Safe replication with MySQL is not impossible but surely more complicated and has much higher performance overhead - you can run slave with –sync-binlog and –log-slave-updates so you can see what last statement was executed before the crash and then find matching position in the master logs.

Interesting enough similar functionality is implemented in Mark Callaghan’s patches if you use rpl_transaction_enabled=1

Note even though this functionality is currently broken other somewhat similar functionality works as expected.
Innodb during recovery also prints position in the MySQL binary log:

SQL:

  1. InnoDB: Last MySQL binlog file position 0 589600615, file name ./galax-bin.001376

This one is helpful in other cases - for example when you’re taking LVM snapshot (for backup or to clone slave from the master) and can’t do traditional and recommended way with FLUSH TABLES WITH READ LOCK for snapshot creation. This happens when you have large amount of tables or your load pattern is to cause unacceptable stall if this lock is used. You can just take LVM Snapshot (assuming you’re only using Innodb tables and not touching your MyISAM system tables) and use this position to point to proper location on the master, or later use for point in time recovery using binary log.


Entry posted by peter |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(How To Set Up Database Replication In MySQL (16 Jan 2006)) (How To Set Up Database Replication In MySQL (16 Jan 2006)) (MySQL Replication and Slow Queries) (Beware: key_buffer_size larger than 4G does not work) (TechNet Webcast: A More Secure and Well-Managed Infrastructure (Part 11 of 18): Secure Data Access With SQL (Level 300)

January 24, 2008

Enum Fields VS Varchar VS Int + Joined table: What is Faster?

Really often in customers’ application we can see a huge tables with varchar/char fields, with small sets of possible values. These are “state”, “gender”, “status”, “weapon_type”, etc, etc. Frequently we suggest to change such fields to use ENUM column type, but is it really necessary (from performance standpoint)? In this post I’d like to present a small benchmark which shows MySQL performance when you use 3 different approaches: ENUM, VARCHAR and tinyint (+joined table) columns.

In practice you can also often use 4th variant which is not comparable directly, which is using integer value and having value mapping done on application level.

So, first of all, a few words about our data set we’ve used for this benchmark. We have 4 tables:
1) Table with ENUM:

SQL:

  1. CREATE TABLE cities_enum (
  2.   id int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  3.   state enum(‘Alabama’,‘Alaska’,‘Arizona’,‘Arkansas’,‘California’,‘Colorado’,‘Connecticut’,‘Delaware’,‘District of Columbia’,‘Florida’,‘Georgia’,‘Hawaii’,‘Idaho’,‘Illinois’,‘Indiana’,‘Iowa’,‘Kansas’,‘Kentucky’,‘Louisiana’,‘Maine’,‘Maryland’,‘Massachusetts’,‘Michigan’,‘Minnesota’,‘Mississippi’,‘Missouri’) NOT NULL,
  4.   city varchar(255) NOT NULL,
  5.   PRIMARY KEY  (id),
  6.   KEY state (state)
  7. ) ENGINE=MyISAM;

2) Table with VARCHAR:

SQL:

  1. CREATE TABLE cities_varchar (
  2.   id int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  3.   state varchar(50) NOT NULL,
  4.   city varchar(255) NOT NULL,
  5.   PRIMARY KEY  (id),
  6.   KEY state (state)
  7. ) ENGINE=MyISAM;

3) Table with INT:

SQL:

  1. CREATE TABLE cities_join (
  2.   id int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  3.   state_id tinyint(3) UNSIGNED NOT NULL,
  4.   city varchar(255) NOT NULL,
  5.   PRIMARY KEY  (id),
  6.   KEY state_id (state_id)
  7. ) ENGINE=MyISAM;

4) Dictionary table for cities_join:

SQL:

  1. CREATE TABLE IF NOT EXISTS `states` (
  2.   `id` tinyint(3) NOT NULL AUTO_INCREMENT,
  3.   `name` char(40) NOT NULL,
  4.   PRIMARY KEY  (`id`),
  5.   UNIQUE KEY `name` (`name`)
  6. ) ENGINE=MyISAM;

All cities_* tables have 1,5M records each and records are distributed among 29 different states (just happens to be data we had available for tests)

Two important notes about this table before we get to results - this is rather small table which fits in memory in all cases (and dictionary table does too). Second - the rows are relatively short in this table so changing state from VARCHAR to ENUM or TINYINT affects row size significantly. In many cases size difference will be significantly less.

All tests are runned 1000 times and the result time is average from those 1000 runs.

So, our first benchmark is simple: we need to get 5 names of cities, located in Minnesota and, to make things slower, we’ll take those records starting from record #10000 making MySQL to discard first 10000 records.

1) Results for ENUM:

SQL:

  1. SELECT SQL_NO_CACHE city FROM cities_enum WHERE state=‘Minnesota’ LIMIT 10000,5;
  2. Result time(mean): 0.082196

2) Results for VARCHAR:

SQL:

  1. SELECT SQL_NO_CACHE city FROM cities_varchar WHERE state=‘Minnesota’ LIMIT 10000,5;
  2. Result time(mean): 0.085637

3) Results for INT + join:

SQL:

  1. SELECT SQL_NO_CACHE c.city FROM cities_join c JOIN states s ON (s.id = c.state_id) WHERE s.name=‘Minnesota’ LIMIT 10000,5;
  2. Result time(mean): 0.083277

So, as you can see, all three approaches are close with ENUM being fastest and VARCHAR few percent slower.

This may look counterintuitive because table is significantly smaller with ENUM or TINYINT but in fact it is quite expected - This is MyISAM table which is accessed via index, which means to retrieve each row MySQL will have to perform OS system call to read the row, at this point there is not much difference if 20 or 30 bytes are being read. For Full Table Scan operation difference often would be larger.

It is also interesting to note performance of Innodb tables in this case: for VARCHAR it takes about 0.022 per query which makes it about 4 times faster than for MyISAM. This is great example of the case when Innodb is much faster than MyISAM for Read load.

The other surprise could be almost zero cost of the join, which we always claimed to be quite expensive. Indeed there is no cost of the join in this case because there is really no join:

SQL:

  1. mysql> EXPLAIN SELECT SQL_NO_CACHE c.city FROM cities_join c JOIN states s ON (s.id = c.state_id) WHERE s.name=‘Minnesota’ LIMIT 10000,5 \G
  2. *************************** 1. row ***************************
  3.            id: 1
  4.   select_type: SIMPLE
  5.         TABLE: s
  6.          type: const
  7. possible_keys: PRIMARY,name
  8.           KEY: name
  9.       key_len: 40
  10.           ref: const
  11.          rows: 1
  12.         Extra:
  13. *************************** 2. row ***************************
  14.            id: 1
  15.   select_type: SIMPLE
  16.         TABLE: c
  17.          type: ref
  18. possible_keys: state
  19.           KEY: state
  20.       key_len: 1
  21.           ref: const
  22.          rows: 225690
  23.         Extra:
  24. 2 rows IN SET (0.10 sec)

Because we refer state by name, which is unique,it is pre-read and query executed basically on single table querying state by ID.

Next test was a result of my curiosity. I’ve tried to order results by states.

1) Results for ENUM:

SQL:

  1. SELECT SQL_NO_CACHE city FROM cities_enum ORDER BY state LIMIT 10000, 5;
  2. Result time(mean): 0.077549

2) Results for VARCHAR:

SQL:

  1. SELECT SQL_NO_CACHE city FROM cities_varchar ORDER BY state LIMIT 10000, 5;
  2. Result time(mean): 0.0854793

3)

SQL:

  1. SELECT SQL_NO_CACHE c.city FROM cities_join c JOIN states s ON (s.id = c.state_id) ORDER BY s.name LIMIT 10000,5;
  2. Result time(mean): 26.0854793

As you can see, ENUM and VARCHAR show close performance, while join performance degraded dramatically.

Here is why:

SQL:

  1. mysql> EXPLAIN SELECT SQL_NO_CACHE c.city FROM cities_join c JOIN states s ON (s.id = c.state_id) ORDER BY s.name LIMIT 10000,5\G
  2. *************************** 1. row ***************************
  3.            id: 1
  4.   select_type: SIMPLE
  5.         TABLE: c
  6.          type: ALL
  7. possible_keys: state
  8.           KEY: NULL
  9.       key_len: NULL
  10.           ref: NULL
  11.          rows: 1439943
  12.         Extra: USING TEMPORARY; USING filesort
  13. *************************** 2. row ***************************
  14.            id: 1
  15.   select_type: SIMPLE
  16.         TABLE: s
  17.          type: eq_ref
  18. possible_keys: PRIMARY
  19.           KEY: PRIMARY
  20.       key_len: 1
  21.           ref: test.c.state_id
  22.          rows: 1
  23.         Extra:
  24. 2 rows IN SET (0.00 sec)

Because we’re sorting by name we have to perform the join for each row to retrieve it. This also means sort can’t be done by index and extra sort pass (filesort) is required, which also makes MySQL to store Join result in temporary table to do the sort, all together makes things quite miserable. Note this might not be best execution plan to pick in this case but this is other story.

To avoid part of this problem we of course arrange state ids in the alphabetical order and do sort by state_id, though join cost still could be significant.

And the last test - selecting city and name in arbitrary order, skipping first 10000 rows to make query times longer.

1) Results for ENUM:

SQL:

  1. SELECT SQL_NO_CACHE city, state FROM cities_enum LIMIT 10000, 5;
  2. Result time(mean): 0.003125

2) Results for VARCHAR:

SQL:

  1. SELECT SQL_NO_CACHE city, state FROM cities_varchar LIMIT 10000, 5;
  2. Result time(mean): 0.003283

3)

SQL:

  1. SELECT SQL_NO_CACHE c.city, s.name FROM cities_join c JOIN states s ON (s.id = c.state_id) LIMIT 10000,5;
  2. Result time(mean): 0.004170

As you can see, ENUM and VARCHAR results are almost the same, but join query performance is 30% lower.
Also note the times themselves - traversing about same amount of rows full table scan performs about 25 times better than accessing rows via index (for the case when data fits in memory!)

So, if you have an application and you need to have some table field with a small set of possible values, I’d still suggest you to use ENUM, but now we can see that performance hit may not be as large as you expect. Though again a lot depends on your data and queries.


Entry posted by Alexey Kovyrin |
8 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Redundant index is not always bad) (Using delayed JOIN to optimize count(*) and LIMIT queries) (Using CHAR keys for joins, how much is the overhead ?) (Trailing spaces in MySQL) (What exactly is read_rnd_buffer_size

December 20, 2007

What is the longest part of Innodb Recovery Process ?

In MySQL 4.1 and above the longest part of recovery after crash for Innodb tables could be UNDO stage - it was happening in foreground and was basically unbound - if you have large enough transaction which needed to be undone this could take long hours.
REDO stage on other hand always could be regulated by size of your Innodb log files so you could have it as large as you like. Read more about it here.

Since MySQL 5.0 the UNDO stage is running in background so it still can be the longest but would not keep server completely unusable (some limitations still apply though).

In the case I’ve been working on recently none of these parts was the longest one.
The server had about 65000 tables using innodb_file_per_table so “InnoDB: Reading tablespace information from the .ibd files…” stage was taking most of the time.

Happily Innodb only needs to scan .ibd files when it was not shut down correctly otherwise restarts would be even more painful.

Even more longest phase has to do with restarts more than crash recovery as it presents in normal restarts as well - “Opening Tables”. As Innodb has to recompute the stats first time it opens the table this can take significant amount of time. Plus worst of all there is serialization in the table cache and only one table can be opened at the time as of MySQL 5.0

It would be great if Innodb would finally optionally store stats, same as MyISAM so one could recompute them in background. Also MySQL should fix things so more than one table can be opened at the same time (though I have not tested if it is still the case with 5.1 which as table_cache code rewritten dramatically)


Entry posted by peter |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Magic Innodb Recovery self healing) (How Innodb flushes data to the disk ?) (Innodb Second Start prevention bug ?) (Innodb Recovery Update - The tricks what failed.) (Innodb crash recovery update

November 26, 2007

Data Recovery Toolkit for InnoDB Version 0.1 Released

As Peter mentioned in one of previous posts, we’ve done huge work developing robust strategies of InnoDB data recovery to provide our customers effective data recovery services and one of major parts of these strategies is our toolkit for InnoDB data recovery. Today I’m proud to announce its first public release which was used to help some of our customers to recover 95-100% of their deleted data.

This release already has a pretty decent set of features:

  • Supports both REDUNDANT (pre mysql 5.0) and COMPACT (mysql 5.0+) versions of tablespaces
  • Works with single tablespaces and file-per-table tablespaces
  • Able to recover data even when processed InnoDB page has been reassigned to another table and/or was partially destroyed
  • Supports all MySQL data types except BLOBs, SETs and BITs (will be implemented in next releases)
  • Has really great set of data filters to define data ranges (for numbers), field lengths (for variable length fields), character sets (for strings), date periods (for dates), etc.
  • Shipped with easy to use tool which could be used to create innodb table definitions based on CREATE TABLE clauses, so you don’t need to write table definitions yourself - you just need to add data filters and get your data back (well, in most of the cases)
  • Results are presented in CSV file format which could be used with MySQL’s LOAD DATA function

So, if you intrigued enough and would like to check it out, welcome to Google Code page of the project where you can find latest version of the toolset code and more links to information resources related to InnoDB data structures and recovery procedures.


Entry posted by Alexey Kovyrin |
One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Magic Innodb Recovery self healing) (Innodb Second Start prevention bug ?) (Innodb Recovery Update - The tricks what failed.) (Innodb crash recovery update) (Disaster Recovery: Simplify File Recovery with Data Protection Manager

November 4, 2007

Heikki Tuuri answers to Innodb questions, Part II

I now got answers to the second portions of the questions you asked Heikki. If you have not seen the first part it can be found here.

Same as during last time I will provide my comments for some of the answers under PZ and will use HT for original Heikkis answer.

Q26: You also say on Unix/Linux only one read-ahead can happen at the same time. How many read-aheads can be waiting in Queue when or Innodb will schedule more read-aheads only when given read-ahead is completed?

HT: A query thread normally posts about 64 page read requests per a readahead. The InnoDB buffer is 256 page read requests for the aio read thread. Thus, about 4 readaheads can be in the queue at a time.

The read aio thread will normally read all the 64 page read requests in a readahead, and then move to serve the next readahead.

PZ: Would be interesting to play with read-ahead depth here. These settings mean there can be quite significant stall waiting for the page which is being requested via read-ahead.

Q27 If my workload has a hi amount of concurrent writes to about 6 tables, what version of Innodb/MySQL would you recommend? I’ve been staying at 4.1.14 because of all the bad things I’ve heard about loss of performance with 5.0.X. I’m hopeful that 5.1.X will be better, but still not sure.

HT: 5.0.xx, where xx >= 30 should be better than 4.1.xx.

PZ: I should note MySQL 5.0 vs 4.1 has a lot of performance differences and not all of them are Innodb specific and it well can affect your workload. So benchmark to be sure. In Innodb besides scalability changes in later 5.0 series compact row format was implemented which can help performance (keeping data smaller) or can a bit slow it down. Plus there were bunch of micro optimizations in Innodb in 5.0 series. Also note the “Auto Increment Issue” is finally solved in 5.1 which may be even better.

Q28: My impression (from an external user of Innodb point of view) is that performance bugs don’t receive very hi priority. Yet, a lot of us are using MySQL because it seems to offer the neccessary performance. Are we just kidding ourselves, and should we switch to Oracle?

HT: You are right, bugs that cause data corruption, crashes, or wrong query results get the highest priority. We are slowly working on the performance enhancement requests. It takes time for the patches to mature. We improved the scalability of InnoDB-5.0.30 in December 2006 with a large patch, and the ‘last’ serious bug in it was only recently found and fixed: http://bugs.mysql.com/bug.php?id=29560

Concerning a switch to Oracle’s main database, I believe different applications work best on different database brands. Oracle’s main database has a very competent team working on it.

Q29: Would doubling the amount of RAM from 8Gig to 16Gig help the write performance of Innodb on Solaris?

HT: Yes, if your ‘write working set’ fits in 16 GB. I mean, if the database buffer pool can buffer many writes to the same page, then fewer disk writes are needed. But if your write working set is, say, 100 GB, then increasing the buffer pool size does not help.

PZ: I should note for most application working set is a distribution rather than fixed number so you can say “fits” or “does not fit”. Depending on data access distribution gains will be different ranging from no gains to very significant ones. Also do not forget you may need to increase size of your Innodb log files to be able to use a lot of memory for write buffering efficiently.

Q30: Question about using solid state disks for storing ib_log* files; I know that innodb writes files in the log file group in a round robin fashion. I also know that there is some writing that always occurs to the first file in a log file group. My questions are: What is always being written to the first file in the log file group?

HT: InnoDB writes the checkpoint information to the first file.

Q31: How often or what is the pattern of these first log file writes?

HT: At every checkpoint. It may happen even more often than once per second if there are dirty pages in the buffer pool.

Q32: Would they always occur to the same memory location(s)?

HT: Yes.

Q33: Do they occur for every log entry?

HT: No.

Q34 I’m just trying to see if there will be some special interaction between these always occurring writes to the first file in the log file group and the SSD dynamic wear leveling algorithm. I’m also wondering if you would recommend using SSD for this purpose, and how much performance gain there might be.

HT: I guess the log writes themselves are much more a problem to the SDD wear leveling: if you commit 1000 transactions per second, then you write 1000 times per second!

A smart wear leveling should be able to cope with writes to the same file position. That kind of access is so common in files.

The wear leveling is used for flash memories. I looked up some information on the web about flash memories, and noticed that random writes to them are actually slower than to a mechanical disk. I do not know how it is for sequential writes, like log flushes.

A question is why to use a flash memory at all, when a battery-backed disk controller write-back cache does the same thing.

PZ: First Indeed writes are slower with Flash disks compared to Reads because Flash requires “erase” cycle to replace the data plus this erase can only happen in the blocks of significant size, say 64K.

But generally I think Log files is not something SSD can give the main benefit to - writes to log files are small and sequential which can be very well taken care of RAID Battery Backed Up cache and when sequentially written to the drives at the rate of 100MB/sec or more. This cache is DRAM based and so much faster than Flash.

SSD Drives would be big win for TableSpace files but these are much larger and so less used for the time being.

What would be really cool if someone would come up with “Persistent DRAM” memory device, which would offer some small amount, say 512M memory which is battery backed up just as RAID cache so it can handle system restarts. When putting log files on such device could give great logging performance because you no more have to go with SCSI interface both in hardware and software. I’m still surprised with databases being so much of commodity no one has something like it.

Q35: I believe this is a somewhat trivial question, but it’s hard to find a good answer to it while googling.“What is the good way to split memory between INNODB buffers and OS?” Of course this depends on all kinds of things, but the only rule of thumb I seem to find is “INNODB should get around 70-80%” of the available RAM. I would guess that the number varied more, depending on the availability of RAM? On a dedicated server for Mysql you will still need some memory for the OS and the various stuff that will be running (monitoring, backup, cron, syslog, etc). But the ratio of RAM used for these non-mysql processes drops rapidly when you increase the amount of system memory. We just deployed a new Linux/Mysql server for a client, upgrading their main database (from 16GB, 2 two-year old HT’ed XEONs to 32 GB, 4x quad-core AMD Opterons. Yes it helped:-). At the moment the server has around 21 GB of memory used for applications (of which Mysql is the main one with “innodb_buffer_pool_size=20480M”) while the bulk of the rest (9 GB) is used for the OS cache (yes I know, we are below 70-80% already) So, is the general recomendation also for boxes with larger amounts of RAM, to give INNODB ~70-80% of the total RAM, and leave the rest to the OS cache? Is there special options or tricks that should be implemented on larger servers?

HT: I think Linux often likes to use about 20 % of the available RAM as its file cache.

The ideal configuration in a dedicated server is: set innodb_flush_method=O_DIRECT in my.cnf, to prevent double buffering. Configure as much memory as possible to InnoDB’s buffer pool. For this, you need to find a way to tell Linux to keep a small OS file cache. I am not sure if Linux currently has any switch for this.

PZ: Generally you can get much lower than 20% for Linux File Cache with good results. I just wrote the post on this this matter the other day - choosing innodb_buffer_pool_size

Q36: There have been several MySQL bugs opened about multi-core scalability (concurrent queries, autoincrement, concurrent inserts, instrumentation, etc.). Rather than treat each defect as an individual case, is there any performance testing going on to quantify what are the limits of different operations for different levels of concurrency, specifically around mutexes and contention? How does this fit into the development process?

HT: Contention and thread thrashing depends very much on the hardware, and also on the OS version. In our tests, removing one bottleneck moves the contention to another place. Thus, no general rules can be established.

I have assigned Inaam as the InnoDB scalability engineer. We do not have much resources for scalability research. Peter Zaitsev, Vadim Tkachenko, Yasufumi Kinoshita, and Mark Callaghan from outside Innobase have helped us greatly to spot scalability problems and find fixes for them. The work continues. I believe improving scalability is a continuous process for all major database brands.

Q37: We have a fairly large Innodb table (150GB) that is showing poor performance for full table scans. using O_DIRECT, we are seeing individual 16KB paged I/O going out one at a time, that are esentially reads misses (10ms each). The reads are not excactly sequential, there is a small gap between each. it is difficult to figure our what the problem is. We can OPTIMIZE the table, but how long will this take? Is there a facility we can run to show any table / index fragmentation, without having to run the OPTIMIZE?

HT: Since OPTIMIZE rebuilds the whole table, it may last way too long for your 150 GB table!

If you have inserted in an ascending order of the PRIMARY KEY, then InnoDB should have placed the rows in contiguous blocks of 64 pages == 1 MB. Then a full table scan would be fast. Maybe the table has become fragmented? Or maybe you have not inserted in the ascending order of the primary key?

I suggest that you test reorganizing that table on a separate computer, maybe a replication slave. Can you make the reorganization run in a tolerable time?

PZ: Indeed using master-master setup and switching roles is a good way to get large data manageable. MySQL Master Master Manager can help you to automate it a bit. I also like to keep tables small enough, “sharding” to many tables before they become so large it would take days to do operations on them. Idea for Heikki - it should be possible to implement online reorganize which locks and reorganizes small portions of table online.

Q38: Seems that adaptive hash indexes are not used for primary key lookups or not showed up in “show innodb status”.

select * from table where id=5 show innodb status: Hash table size 10624987, used cells 1, node heap has 1 buffer(s) 0.00 hash searches/s, 15979.01 non-hash searches/s

HT: Strange, primary key equality searches should work through the adaptive hash index! What is your database like? Do you have a big enough buffer pool?

PZ: I guess it may be bug in the status or bug in implementation. I’ll check with Alexander or check myself to see what it could be.


Entry posted by peter | 2 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Heikki Tuuri to answer your in depth Innodb questions) (Heikki Tuuri Innodb answers - Part I) (How simple answer are you looking for ?) (Innodb Recovery - Is large buffer pool always better ?) (Opening Tables scalability

November 3, 2007

Choosing innodb_buffer_pool_size

My last post about Innodb Performance Optimization got a lot of comments choosing proper innodb_buffer_pool_size and indeed I oversimplified things a bit too much, so let me write a bit better description.

Innodb Buffer Pool is by far the most important option for Innodb Performance and it must be set correctly. I’ve seen a lot of clients which came through extreme sufferings leaving it at default value (8M). So if you have dedicated MySQL Box and you’re only using Innodb tables you will want to give all memory you do not need for other needs for Innodb Buffer Pool.

This of course assumes your database is large so you need large buffer pool, if not - setting buffer pool a bit larger than your database size will be enough. You also should account for growth of course. You need buffer pool a bit (say 10%) larger than your data (total size of Innodb TableSpaces) because it does not only contain data pages - it also contain adaptive hash indexes, insert buffer, locks which also take some time. Though it is not as critical - for most workloads if you will have your Innodb Buffer Pool 10% less than your database size you would not loose much anyway.

You also may choose to set buffer pool as if your database size is already larger than amount of memory you have - so you do not forget to readjust it later. This is also valid approach as if it is Dedicated MySQL Server you may not have a good use for that memory anyway.

Another thing you should keep into account is Innodb allocates more memory in structures related to buffer pool than you specify for it - I just checked on our set of boxes which use 12GB buffer pool (with very small additional pool and log buffer) and total amount of memory allocated by Innodb is typically a bit over 13GB. Vadim has posted some numbers on it a while back.

After you have decided with database size you need to check if there are any restrictions on Innodb Buffer Size you can use. Typically you would see restriction applying only on 32bit systems but we see these can be still spotted in a wild, especially in Windows world. The restriction would normally apply to total amount of memory process can allocate so make sure to leave space for other MySQL needs while factoring this in.

The next step would be to decide How Much Memory do you need for other needs. This needs would be OS needs - your system processes, page tables, socket buffers etc all need memory. I would put this to 256M for small sized boxes to 5% of memory size on the big boxes, though it can be even less than that. Besides Operating System needs you also have MySQL needs - these include MySQL buffers - query cache, key_buffer, mysql threads, temporary tables, per thread sort buffer which can be allocated. There are also things like innodb additional memory pool (which can grow more than memory you allocated for it, especially in case you have large amount of tables).

I could tell you some numbers, for example sum up all your global buffers plus add 1MB for each connection you’re planning to have but in reality the number can vary significantly depending on the load. Idle connections for example will consume significantly less memory than connections doing work with huge temporary tables and otherwise running complex queries. It is usually much better to simply check it. Start MySQL With 10GB Innodb buffer pool for example and see how large RSS and VSZ get in “ps” output on Unix Systems. If it gets to 12GB when you need 2GB for other stuff, and you can increase it a bit to be on the safe size and scale appropriately.

The third important memory consumer would be OS cache. You want to bypass cache for your Innodb tables but there are other things you need OS cache for - MyISAM tables (mysql database, temporary etc) will need it, .frm file, binary logs, or relay logs, Innodb Transactional Logs also like to be cached otherwise OS will need to do reads to serve writes to these log files as IO to the log files is not aligned to the page boundary. Finally you likely have some system script/processes running on the system which also need some cache. The number can be a lot different depending on system workload but generally I’d see values from 200MB to 1GB good estimates for this number.

Eliminate Double Buffering - This is again very important for buffer pool size choice. You do not want OS to cache what Innodb is caching already. Innodb cache is more efficient compared to OS cache because there is no copying, due to adaptive hash indexes, ability to buffer writes and number of other factors so you just want to make your OS to give a way to Innodb. Note it is not enough to block OS from swapping - as I already mentioned the OS cache will be needed for other things and if you will not make Innodb to bypass OS buffering Innodb TableSpace IO will wipe out cache because it typically makes most of the IO on the system. On Windows you do not need to do anything. On Linux, FreeBSD, Solaris you need to set innodb_flush_method=O_DIRECT. On other Operating Systems you may be able to select it on OS level but make sure to do it. There is a small niche case when it hurts - when you do not have RAID with BBU and your workload is very write intensive but there are always exceptions.

Make your OS Happy The other challenge you may have is making your OS happy and avoiding swapping out MySQL Process or other important processes to make room for file cache. OS may find it unfair there is MySQL process which consumes 95% of memory and the cache is just couple of percent. Some people try to solve it with disabling swap file but it can hurt another way - OS may kill MySQL Process running out of memory (or thinking it is running out of memory) which may happen ie in case of unexpected connection spike. Plus not all kernels work quite well with swap disabled and there are other reasons against it. For some people having no swap works, though they usually play on the safe side, having enough “free” memory used as Cache and Buffers.

Depending on OS you may want to do different VM memory adjustments. You may want to make MySQL to use Large Pages for allocating Innodb Buffer Pool and few other buffers, which may have other performance benefits as well. Tuning your VM to be less eager to swap things by echo 0 > /proc/sys/vm/swappiness is another helpful change though it does not always save you from swapping. Some specific kernel versions may have other settings to play with. Finally you can try locking MySQL in memory by using –memlock - just be careful as in case you have memory usage spike you may have MySQL Server being killed by OS instead of temporary swapping few things out.

I guess these clarifications work better than 70-80% recommendation and of course you should not stick just to 50GB if you have 64G of memory - values of 56-60G would likely make more sense, and depending on bunch of other settings 12G or 14G may well be good choice for 16GB Box, though I would take care at values close to 14G as there is not much room left for other things.

P.S I only described Innodb Buffer Pool selection for dedicated Innodb system. If you have fair amount of MyISAM, Archive, PBXT, Falcon or other storage engines then you will get into complex balancing game besides considering all these factors.


Entry posted by peter | 2 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Turning RFID Data into Information) (Keywords and search engine optimization) (Increase Your Reach and Your Market with the Microsoft® Stack) (Trailing spaces in MySQL) (Using index for ORDER BY vs restricting number of rows.

November 1, 2007

Innodb Performance Optimization Basics

Interviewing people for our Job Openings I like to ask them a basic question - if you have a server with 16GB of RAM which will be dedicated for MySQL with large Innodb database using typical Web workload what settings you would adjust and interestingly enough most people fail to come up with anything reasonable. So I decided to publish the answer I would like to hear extending it with basics of Hardware OS And Application optimization. I call this Innodb Performance Optimization Basics so these are general guidelines which work well for wide range of applications, though the optimal settings of course depend on the workload.

Hardware If you have large Innodb database size Memory is paramount. 16G-32G is the cost efficient value these days. From CPU standpoint 2*Dual Core CPUs seems to do very well, while with even just two Quad Core CPUs scalability issues can be observed on many workloads. Though this depends on the application a lot. The third is IO Subsystem - directly attached storage with plenty of spindles and RAID with battery backed up cache is a good bet. Typically you can get 6-8 hard drives in the standard case and often it is enough, while sometimes you may need more. Also note new 2.5″ SAS hard drives. They are tiny but often faster than bigger ones. RAID10 works well for data storage and for read-mostly cases when you still would like some redundancy RAID5 can work pretty well as well but beware of random writes to RAID5.

Operating System First - run 64bit operating system. We still see people running 32bit Linux or 64bit capable boxes with plenty of memory. Do not do this. If using Linux setup LVM for database directory to get more efficient backup. EXT3 file system works OK in most cases, though if you’re running in particular roadblocks with it try XFS. You can use noatime and nodiratime options if you’re using innodb_file_per_table and a lot of tables though benefit of these is minor. Also make sure you wrestle OS so it would not swap out MySQL out of memory.

MySQL Innodb Settings The most important ones are: innodb_buffer_pool_size 70-80% of memory is a safe bet. I set it to 12G on 16GB box. innodb_log_file_size - This depends on your recovery speed needs but 256M seems to be a good balance between reasonable recovery time and good performance innodb_log_buffer_size=4M 4M is good for most cases unless you’re piping large blobs to Innodb in this case increase it a bit. innodb_flush_logs_at_trx_commit=2 If you’re not concern about ACID and can loose transactions for last second or two in case of full OS crash than set this value. It can dramatic effect especially on a lot of short write transactions. innodb_thread_concurrency=8 Even with current Innodb Scalability Fixes having limited concurrency helps. The actual number may be higher or lower depending on your application and default which is 8 is decent start innodb_flush_method=O_DIRECT Avoid double buffering and reduce swap pressure, in most cases this setting improves performance. Though be careful if you do not have battery backed up RAID cache as when write IO may suffer. innodb_file_per_table - If you do not have too many tables use this option, so you will not have uncontrolled innodb main tablespace growth which you can’t reclaim. This option was added in MySQL 4.1 and now stable enough to use.

Also check if your application can run in READ-COMMITED isolation mode - if it does - set it to be default as transaction-isolation=READ-COMITTED. This option has some performance benefits, especially in locking in 5.0 and even more to come with MySQL 5.1 and row level replication.

There are bunch of other options you may want to tune but lets focus only on Innodb ones today. You can check about tuning other options here or read one of our MySQL Presentations.

Application tuning for Innodb Especially when coming from MyISAM background there would be some changes you would like to do with your application. First make sure you’re using transactions when doing updates, both for sake of consistency and to get better performance. Next if your application has any writes be prepared to handle deadlocks which may happen. Third you would like to review your table structure and see how you can get advantage of Innodb properties - clustering by primary key, having primary key in all indexes (so keep primary key short), fast lookups by primary keys (try to use it in joins), large unpacked indexes (try to be easy on indexes).

With these basic innodb performance tunings you will be better of when majority of Innodb users which take MySQL with defaults run it on hardware without battery backed up cache with no OS changes and have no changes done to application which was written keeping MyISAM tables in mind.


Entry posted by peter | 9 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(IBM DB2 Performance Optimization eKit) (MySQL Master Class after HighLoad.RU) (How Innodb flushes data to the disk ?) (UC2007 Presentation and Notes) (Speaking at MySQL Users Conference 2007

October 31, 2007

The new cool MySQL patch has landed! Check your queries performance!

Microslow patch is used by many DBAs and developers to accurately time their queries and to catch those which run less than a second as they can also be a performance killer for a busy application.

Recently I have started the development of an updated version of the patch. The basic idea is the same as for its predecessor - to get more information about query execution logged into slow log, however the new version is loaded with a set of cool new features.


CONNECTION IDENTIFIER

Each slow log entry now contains a connection identifier, so you can trace all the queries coming from a single connection.

CODE:
  1. # Thread_id: 4

MICROTIME RESOLUTION QUERY TIMING

This is the original functionality offered by Microslow patch. The new edition is free of a tiny bug which was to treat long_query_time value as seconds. In effect one could only choose to log all queries to see those which take less than 1s. Now long_query_time is in microseconds and you can set it for example to 300000 which is 0.3s!

CODE:
  1. # Query_time: 0.503016  Lock_time: 0.000048  Rows_sent: 56  Rows_examined: 1113

LOGGING OF THE REPLICATED STATEMENTS

Normally MySQL will not write into slow log any queries executed by the slave’s SQL thread. This patch allows you to change that behavior with the new parameter --log-slow-slave-statements.

THE DETAILED INFORMATION ABOUT QUERY EXECUTION

Each query can be executed in various ways. For example it may use indexes or do a full table scan, or a temporary table may be needed. These are the things that you can usually see by running EXPLAIN on the query. The patch will now allow you to see the most important facts about the execution in the log file.

CODE:
  1. # QC_Hit: No  Full_scan: No  Full_join: No  Tmp_table: Yes  Disk_tmp_table: No
  2. # Filesort: Yes  Disk_filesort: No  Merge_passes: 0

QC_Hit tells whether query cache hit occurred or not. If it says ‘Yes’, then all other values will be ‘No’ or 0, because the query was not actually executed by the engine.

If Full_scan is set then most likely the query is a bad one, because it read all the rows from a table.

Full_join means any of the joins didn’t use indexes.

If a temporary table is required for query execution Tmp_table will be set. Sometimes that table must be created on disk instead of in memory, in such case Disk_tmp_table will state that.

Filesort is for indicating that filesort algorithm was used, while Disk_filesort means that sorting was done through temporary files.

INNODB USAGE

The final part are the InnoDB usage statistics. MySQL currently allows you to see many per-session statistics for operations with SHOW SESSION STATUS, but that does not include those of InnoDB, which are always global and shared by all the threads. The patch gives you the opportunity to see those values for a given query.

CODE:
  1. #   InnoDB_IO_r_ops: 19  InnoDB_IO_r_bytes: 311296  InnoDB_IO_r_wait: 0.382176
  2. #   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.067538
  3. #   InnoDB_pages_distinct: 20

InnoDB_IO_r_ops is the counter for the number of page read operations scheduled. The acutal number of read operations may be different, but since this can be done asynchronously, unfortunatelly there is no good way to measure that. InnoDB_IO_r_bytes is the corresponding value in bytes.

With InnoDB_IO_r_wait you can see how much time (in seconds) it took InnoDB to actually read the data from storage.

InnoDB_rec_lock_wait is the time (in seconds) query had to wait on row locks.

InnoDB_queue_wait is the time (in seconds) query spent either waiting to enter the InnoDB queue or inside that queue waiting for execution.

InnoDB_pages_distinct tells the number of unique pages accessed by the query. This is actually an approximation based on a small hash array representing the entire buffer pool, because it could take a lot of memory to map all the pages. The inaccuracy grows with the number of pages accessed by a query as there is more chance for hash collisions.

In case InnoDB is not used for the query execution that information is written into the log instead the above statistics.


The full slow log entry for a query using InnoDB can look like this:

CODE:
  1. # User@Host: macko[macko] @ localhost []
  2. # Thread_id: 4
  3. # Query_time: 0.503016  Lock_time: 0.000048  Rows_sent: 56  Rows_examined: 1113
  4. # QC_Hit: No  Full_scan: No  Full_join: No  Tmp_table: Yes  Disk_tmp_table: No
  5. # Filesort: Yes  Disk_filesort: No  Merge_passes: 0
  6. #   InnoDB_IO_r_ops: 19  InnoDB_IO_r_bytes: 311296  InnoDB_IO_r_wait: 0.382176
  7. #   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.067538
  8. #   InnoDB_pages_distinct: 20
  9. SET timestamp=1193841780;
  10. SELECT DISTINCT c from sbtest where id between 501895 and 502895 order by c;

or if InnoDB was not used:

CODE:
  1. # Time: 071031 20:03:16 # User@Host: macko[macko] @ localhost []
  2. # Thread_id: 12
  3. # Query_time: 0.026385  Lock_time: 0.000158  Rows_sent: 1  Rows_examined: 0
  4. # QC_Hit: No  Full_scan: No  Full_join: No  Tmp_table: No  Disk_tmp_table: No
  5. # Filesort: No  Disk_filesort: No  Merge_passes: 0
  6. # No InnoDB statistics available for this query
  7. SELECT COUNT(1) FROM t2;

Important note: Although the patch was tested and run successfully under stress with both synthetic benchmarks and real database traffic, it”s considered experimental and as such I don”t recommend using it in production or on any important system!

The patch is available for MySQL Community 5.0.45.

Maciek Dobrzanski


Entry posted by Maciej Dobrzanski | One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Microslow patch for 5.1.20) (Enabling/Disabling query log without server restart in MySQL 5.0) (Getting real life query speeds with MySQL) (Surprise, surprise!) (Yahoo Search Suggestions for MySQL

October 29, 2007

Hacking to make ALTER TABLE online for certain changes

Suppose you want to remove auto_increment from 100G table. No matter if it”s InnoDB or MyISAM, you”d usually ALTER TABLE `huge_table` CHANGE `id` `id` int(6) NOT NULL and then wait hours for table rebuild to complete. If you”re unlucky i.e. you have a lot of indexes and not too much RAM - you could end up waiting days. If you want to make this happen quick - there”s another way. Not documented, but works well with both - InnoDB and MyISAM.

Now that more and more folks hit the InnoDB auto-inc scalability issue with MySQL 5.0 and older versions, employing other techniques to maintain the PK auto incremental becomes more of an issue. One of the steps here is to remove current PK auto_incremental from the table. As a rule of thumb, this usually involves altering huge InnoDB tables and huge tables take time to rebuild.

Disclaimer: try this at your own risk. It worked for me, it may work for you too, but always have a backup before doing that kind of stuff, as it is not the way MySQL would advice you to do it and we can”t guarantee it will work well for you either.

So, in a nutshell, all you have to do is create another table with desired table structure and switch .frm table definition files. For safety, I”d recommend to flush tables with read lock while switching .frm files. When and how it works:

auto_increment (removing). Let”s have a simple table with auto_increment we want to get rid of:

SQL:
  1. CREATE TABLE `huge_table` (
  2.   `id` int(6) NOT NULL AUTO_INCREMENT,
  3.   `text` text NOT NULL,
  4.   PRIMARY KEY  (`id`)
  5. ) ENGINE=InnoDB

To remove auto_increment, we (1) create table with the same layout but without auto_increment, (2) flush tables with read lock, (3) swap .frm files while keeping mysql suspended and (4) unlock the tables afterwards:

SQL:
  1. – 1.
  2. mysql> CREATE TABLE `huge_table_new` (
  3. ->   `id` int(6) NOT NULL,
  4. ->   `text` text NOT NULL,
  5. ->   PRIMARY KEY  (`id`)
  6. -> ) ENGINE=INNODB;
  7. Query OK, 0 rows affected (0.01 sec)
  8.  
  9. – 2.
  10. mysql> FLUSH TABLES WITH READ LOCK;
  11. Query OK, 0 rows affected (0.00 sec)
  12.  
  13. – 3.
  14. mysql> ^Z
  15. [1]+  Stopped                 mysql test
  16. # mv huge_table.frm huge_table_old.frm && mv huge_table_new.frm huge_table.frm && mv huge_table_old.frm huge_table_new.frm
  17. # fg
  18. mysql test
  19. – 4.
  20. mysql> UNLOCK TABLES;
  21. Query OK, 0 rows affected (0.00 sec)
  22.  
  23. mysql> SHOW CREATE TABLE `huge_table`G
  24. *************************** 1. row ***************************
  25. TABLE: huge_table
  26. CREATE TABLE: CREATE TABLE `huge_table` (
  27.   `id` int(6) NOT NULL,
  28.   `text` text NOT NULL,
  29.   PRIMARY KEY  (`id`)
  30. ) ENGINE=InnoDB DEFAULT CHARSET=latin1
  31. 1 row IN SET (0.00 sec)
  32.  
  33. mysql> INSERT INTO `huge_table` (text) VALUES (“test”);
  34. ERROR 1062 (23000): Duplicate entry “0″ FOR KEY 1

Unfortunately, adding auto_increment does not work that way.

Enum values (add and remove). Enumerated values are added and removed the same way that auto_increment is removed. I”ve been a bit surprised, that removing value from enum() works as good as adding it - rows that have incorrect values are just returned as empty. But I suppose this does violate mysql data file structure, so be really careful with that one.

Default values. MySQL rebuilds table even if we only want to change the default value for new records so this may save one from a lot of trouble.

Table comment. I”m pretty sure that would work for changing table comment as well, however - changing a comment with a help of ALTER TABLE does not rebuild the table, so we better use the documented method for that.

What I”m surprised about is that changing a comment does not to require table to be rebuilt, while things like removing auto_increment or changing a default value still do even though this information is stored in table definition file.

If you”ll ever try this, please leave a comment if it did work for you. Maybe you have discovered some new things to alter that way?


Entry posted by Aurimas | 4 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Related Posts

(Metalink Hacking V1.01) (Working with large data sets in MySQL) (MSDN Webcast: A Hacker’s View of Your Web Applications Part 2: Web Hacking - Attack Scenarios and Examples - Level 300) (Online Gambling and Kids: A Bad Bet) (Magic Innodb Recovery self healing
« Previous entries