Love your Exadata! Use Exadata Database DBRM

August 22, 2013, 7:15 am

≪ Previous: Patching an Exadata Compute Node

You have bought Exadata after spending a fortune and more importantly after convincing everybody in the organization and in your family that this is the ultimate resource machine which hosts the world’s best database, Oracle 12c. You are not off the mark with those statements, though one might term them bit exaggerated. The point is […]

The post Love your Exadata! Use Exadata Database DBRM appeared first on VitalSoftTech.

↧

Exadata - a new learning cuvrve

October 28, 2013, 6:42 am

≫ Next: Exadata Write-back cache and free buffer waits

≪ Previous: Love your Exadata! Use Exadata Database DBRM

Perhaps I might be touch late on adopting/exploring/reacting/coping up with the exadata, but, it was always my dream and passion to work with the technologies since the announcement. As we knew Exadata is not something we can download and configure that easily on PC like those of traditional database or RAC software. Luckily and with the immense help from one of my friends, I managed to simulate Exadata setup (faked) on my new Macpro Book, 2 cell servers and 2 Oracle 12c RAC db nodes. Believe me, they are running incredibly fast and I am already on my way to explore/test the capabilities of Exadata.

I have also set some goals so that I will not become lazy or go slow on what I am doing. Planning to appear Exadata Admin and Implementation Specialist certification this December and January. If you are one of me and on the same boat as I am, feel free to contact me to discuss about Exadata stuff.

Have a good day

↧

Exadata Write-back cache and free buffer waits

November 21, 2013, 1:49 pm

≫ Next: Redo on SSD: effect of redo size (Exadata)

≪ Previous: Exadata - a new learning cuvrve

Prior to storage server software version 11.2.3.2.0 (associated with Exadata X3), Exadata Smart Flash Cache was a “write-through” cache, meaning that write operations are applied both to the cache and to the underlying disk devices, but are not signalled as complete until the IO to the disk has completed.

Starting with 11.2.3.2.0 of the Exadata storage software[1], Exadata Smart Flash Cache may act as a write-back cache. This means that a write operation is made to the cache initially and de-staged to grid disks at a later time. This can be effective in improving the performance of an Exadata system that is subject to IO write bottlenecks on the Oracle datafiles.

Writes to datafile generally happen as a background task in Oracle, and most of the time we don’t actually “wait” on these IOs. That being the case, what advantage can we expect if these writes are optimized? To understand the possible advantages of the write-back cache let’s review the nature of datafile write IO in Oracle and the symptoms that occur when write IO becomes the bottleneck.

When a block in the buffer cache is modified, it is the responsibility of the database writer (DBWR) to write these “dirty” blocks to disk. The DBWR does this continuously and uses asynchronous IO processing, so generally sessions do not have to wait for the IO to occur – the only time sessions wait directly on write IO is when a redo log sync occurs following a COMMIT.

However, should all the buffers in the buffer cache become dirty then a session may wait when it wants to bring a block into the cache – resulting in a “free buffer wait”.

Free buffer waits can occur in update-intensive workloads when the IO bandwidth of the Oracle sessions reading into the cache exceeds the IO bandwidth of the database writer. Because the database writer uses asynchronous parallelized write IO, and because all processes concerned are accessing the same files, free buffer waits usually happen when the IO subsystem can service reads faster than it can service writes.

There exists just such an imbalance between reads and write latency in Exadata X2 – the Exadata Smart Flash Cache accelerates reads by a factor of perhaps 4-10 times, while offering no comparable advantage for writes. As a result, a very busy Exadata X2 system could become bottlenecked on free buffer waits. The Exadata Smart Flash Cache write-back cache provides acceleration to datafile writes as well as reads and therefore reduces the chance of free buffer wait bottlenecks.

The figure below illustrates the effectiveness of the write-back cache for workloads that encounter free buffer waits. The workload used to generate this data was heavily write-intensive with very little read IO overhead (all the necessary read data was in cache). As a result, it experienced a very high degree of free buffer waits and some associated buffer busy waits. Enabling the write-back cache completely eliminated the free buffer waits by effectively accelerating the write IO bandwidth of the database writer. As a result, throughput increased four fold.

However, don’t be misled into thinking that the write-back cache will be a silver bullet for any workload. Workloads that are experiencing free buffer waits are likely to see this sort of performance gain. Workloads where the dominant waits are for CPU, read IO, global cache co-ordination, log writes and so on will be unlikely to see any substantial benefit from the write-back cache.

[1] 11.2.3.2.1 is recommended as the minimum version for this feature as it contains fixes to significant issues discovered in the initial release.

↧

Redo on SSD: effect of redo size (Exadata)

November 25, 2013, 8:31 pm

≫ Next: To DMA, Or Not To DMA

≪ Previous: Exadata Write-back cache and free buffer waits

Of all the claims I make about SSD for Oracle databases, the one that generates the most debate is that placing redo logs on SSD is not likely to be effective. I’ve published data to that effect in particular see Using SSD for redo on Exadata - pt 2 and 04 Evaluating the options for Exploiting SSD.

I get a lot of push back on these findings – often on theoretical grounds from Flash vendors (“our SSD use advanced caching and garbage collection that support high rates of sequential IO”) or from people who say that they’ve used flash for redo and it “worked fine”.

Unfortunately, every single test I do comparing performance of redo on flash and HDD shows redo with little or no advantage and in some cases with a clear disadvantage.

One argument for flash SSD that I’ve heard is that while for the small transactions I use for testing flash might not have the advantage but for “big” redo writes – such as those associated with LOB updates – flash SSD would work better. The idea is that the overhead of garbage collection and free page pool processing is less with big writes since you don’t hit the same flash SSD pages in rapid succession as you would with smaller writes. On the other hand a reader who knows more about flash than I do (flashdba.com) recently commented: “in foreground garbage collection a larger write will require more pages to be erased, so actually will suffer from even more performance issues.”

It’s taken me a while to get around to testing this, but I tried on our Exadata X-2 recently with a test that generates a variable amount of redo and then commits. The relationship between the size of the redo and redo log sync time is shown below

I’m now putting on my flame retardant underwear in anticipation of some dispute over this data…. but, this suggests that while SSD and HDD (at least on Exadata) are about at parity for small writes, flash degrades much more steeply than HDD as the size of the redo entry increases. Regardless of whether the redo is on flash or HDD, there’s a break at the 1MB point which corresponds to log buffer flush threshold. When a redo entry is only slightly bigger than 1MB then the chances are high that some of it will have been flushed already – see Redo log sync time vs redo size for a discussion of this phenomenon.

The SSD redo files were on an ASM disk group carved out of the Exadata flash disks - see Configuring Exadata flash as grid disk to see how I created these. Also the redo logs were created with 4K blocksize as outlined in Using SSD for redo on Exadata - pt 2. The database was in NoarchiveLog mode. Smart flash logging was disabled. As far as I can determine, there was no other significant activity on the flash disks (the grid disks were supporting all the database tablespaces, so if anything the SSD had the advantage).

Why are we seeing such a sharp dropoff in performance for the SSD as the redo write increases in size? Well one explanation was given by flashdba in this comment thread. It has to do with understanding what happens when a write IO which modifies an existing block hits a flash SSD. I tried to communicate my limited understanding of this process in Fundamentals of Flash SSD Technology. Instead of erasing the existing page, the flash controller will pull a page off a “free list” of pages and mark the old page as invalid. Later on, the garbage collection routines will reorganize the data and free up invalid pages. In this case, it’s possible that no free blocks were available because garbage collection fell behind during the write intensive workload. The more blocks written by LGWR, the more SSD pages had to be erased during these un-optimized writes and therefore the larger the redo log write the worse the performance of the SSD.

Any other theories and/or observations?

I hope soon to have a Dell system with Dell express flash so as I can repeat these tests on a non-exadata system. The F20 cards used in my X-2 are not state of the art, so it’s possible that different results could be obtained with a more recent flash card, or with a less contrived workload.

However, yet again I’m gathering data that suggests that using flash for redo logs is not worthwhile. I’d love to argue the point but even better than argument would be some hard data in either direction….

↧

To DMA, Or Not To DMA

December 15, 2013, 11:35 am

≫ Next: Can the Exadata Smart Flash Cache slow smart scans?

≪ Previous: Redo on SSD: effect of redo size (Exadata)

Exadata is a different system for a DBA to administer. Some tasks in this environment, such as running the exachk script, require root O/S privileges. This script can be run by the system administrator, and this will be the case if you are managing Exadata as a DBA. However, a new role has emerged relative to Exadata, that of the Database Machine Administrator, or DMA. Let’s look at what being a DMA really means.

In addition to the usual DBA skillset, the DMA must also be familiar with, and be able to understand, the following management and monitoring commands on the specified systems.

On the compute nodes (database nodes):

Linux: top , mpstat , vmstat , iostat , fdisk , ustat , sar , sysinfo
Exadata: dcli
ASM: asmcmd , asmca
Clusterware: crsctl , srvctl

On the storage servers/cells:

Linux: top , mpstat , vmstat , iostat , fdisk , ustat , sar , sysinfo
Cell management: cellcli , cellsrvstat

Being a DMA also includes other areas of responsibility not associated with being a DBA. The following table summarizes the areas of responsibility for a DMA:

*******DMA Responsibilities	*******
Skill	Percent
System Administrator	15
Storage Administrator	0
Network Administrator	5
Database Administrator	60
Cell Administrator	20

The ‘Percent’ column indicates the percentage of the overall Exadata system requiring this knowledge, and as you can see if you’ve been an 11g RAC administrator, you have 60 percent of the skillset required to be a DMA. The remaining skills necessary to be a DMA are not difficult to learn and master. The Cell Administrator commands you will need ( cellcli , dcli ) will increase your knowledge to 80 percent of the DMA skillset. CellCLI is the command-line interface to monitor and manage the storage cells. There are three supplied logins to each storage cell and these are ‘root’, ‘cellmonitor’ and ‘celladmin’. As you can probably guess ‘celladmin’ is the most powerful login that isn’t ‘root’ (the superuser in Linux and Unix). You can do most anything to the storage cells, including startup and shutdown, with ‘celladmin’. The ‘cellmonitor’ user can generate reports and list attributes from the storage cells but has no authority to perform management tasks. The full list of available cellcli commands is shown below:

CellCLI> help

 HELP [topic]
   Available Topics:
        ALTER
        ALTER ALERTHISTORY
        ALTER CELL
        ALTER CELLDISK
        ALTER GRIDDISK
        ALTER IBPORT
        ALTER IORMPLAN
        ALTER LUN
        ALTER PHYSICALDISK
        ALTER QUARANTINE
        ALTER THRESHOLD
        ASSIGN KEY
        CALIBRATE
        CREATE
        CREATE CELL
        CREATE CELLDISK
        CREATE FLASHCACHE
        CREATE FLASHLOG
        CREATE GRIDDISK
        CREATE KEY
        CREATE QUARANTINE
        CREATE THRESHOLD
        DESCRIBE
        DROP
        DROP ALERTHISTORY
        DROP CELL
        DROP CELLDISK
        DROP FLASHCACHE
        DROP FLASHLOG
        DROP GRIDDISK
        DROP QUARANTINE
        DROP THRESHOLD
        EXPORT CELLDISK
        IMPORT CELLDISK
        LIST
        LIST ACTIVEREQUEST
        LIST ALERTDEFINITION
        LIST ALERTHISTORY
        LIST CELL
        LIST CELLDISK
        LIST FLASHCACHE
        LIST FLASHCACHECONTENT
        LIST FLASHLOG
        LIST GRIDDISK
        LIST IBPORT
        LIST IORMPLAN
        LIST KEY
        LIST LUN
        LIST METRICCURRENT
        LIST METRICDEFINITION
        LIST METRICHISTORY
        LIST PHYSICALDISK
        LIST QUARANTINE
        LIST THRESHOLD
        SET
        SPOOL
        START

CellCLI>

All of the above commands are available to ‘celladmin’; only the LIST, DESCRIBE, SET and SPOOL commands are available to ‘cellmonitor’.

Networking commands that you may need are ifconfig , iwconfig , netstat , ping , traceroute , and tracepath . You may, at some time, also need ifup and ifdown , to bring up or bring down network interfaces, although using these commands will not be a regular occurrence. The following example shows how to bring up the eth0 interface.

# ifup eth0

It seems like a daunting task, to become a DMA, but it really isn’t that difficult. It does require a slightly different mindset, as you are now looking at, and managing, the entire system, rather than just the database. There will still be a need for a dedicated System Administrator and Network Administrator for your Exadata system, because, as a DMA, you won’t be responsible for configuration of these resources, nor will you be responsible for patching and firmware upgrades. The DMA is, essentially, assisting these dedicated administrators by assuming the day-to-day tasks these resources would provide. Being a DMA is also more useful to you and to the enterprise as the regular tasks for these areas can be performed by the person or persons who do most of the interaction with Exadata on a daily basis. Enterprises vary, however, and it may not be possible to assume the role of DMA as the division of duties is strictly outlined and enforced. It is good to know, though, that such a role exists and may be made available to you at some time in the future.

↧

Can the Exadata Smart Flash Cache slow smart scans?

December 29, 2013, 8:37 pm

≫ Next: Using SSD for a temp tablespace on Exadata

≪ Previous: To DMA, Or Not To DMA

I’ve been doing some work on the Exadata Smart Flash Cache recently and came across a situation in which setting CELL_FLASH_CACHE to KEEP will significantly slow down smart scans on a table.

If we create a table with default settings, then the Exadata Smart Flash Cache (ESFC) will not be involved in smart scans, since by default only small IOs get cached. If we want the ESFC to be involved, we need to set the CELL_FLASH_CACHE to KEEP. Of course, we don’t expect immediate improvements, since we expect that the next smart scan will need to populate the cache before subsequent scans can benefit.

HOWEVER, what I’m seeing in practice is that the next smart scan following an ALTER TABLE … STORAGE(CELL_FLASH_CACHE KEEP) is significantly degraded, while subsequent scans get a performance boost. Here’s an example of what I observe:

The big increase in CELL IO time is in an increase in the NUMBER of cell smart table scans. The wait stats for the first scan with a default setting looked like this:

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait Total Waited
----------------------------------------   Waited ---------- ------------
gc cr disk read                                 1        0.00          0.00
cell single block physical read                 2        0.01          0.01
row cache lock                                  2        0.00          0.00
gc cr grant 2-way                               1        0.00          0.00
SQL*Net message to client                    1021        0.00          0.00
reliable message                                1        0.00          0.00
enq: KO - fast object checkpoint                2        0.00          0.00
cell smart table scan                        9322        0.14          7.60
SQL*Net message from client                  1021        0.00          0.02

For the first scan with KEEP cache it looked like this:

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait Total Waited
----------------------------------------   Waited ---------- ------------
SQL*Net message to client                    1021        0.00          0.00
reliable message                                1        0.00          0.00
enq: KO - fast object checkpoint                2        0.00          0.00
cell smart table scan                       14904        1.21         33.37
SQL*Net message from client                  1021        0.00          0.02

Looking at the raw trace file didn’t help – it just shows a bunch of lines like this, with only a small number (3 in this case) of unique cellhash values… I couldn’t see a pattern:

WAIT #… : nam='cell smart table scan' ela= 678 cellhash#=398250101 p2=0 p3=0 obj#=139207 tim= …

I’m at a loss to understand why there would be such a high penalty for the initial smart scan with CELL_FLASH_CACHE KEEP setting. You expect some overhead from constructing and storing the result set blocks in the cache, but an IO penalty of 200=300% seems way too high. Anybody seen anything like this or have a clear explanation?

Test script is here, and formatted tkprof here

↧

Using SSD for a temp tablespace on Exadata

January 2, 2014, 12:10 am

≫ Next: Troubleshooting Oracle DBFS mount issues

≪ Previous: Can the Exadata Smart Flash Cache slow smart scans?

I seem to be getting a lot of surprising performance results lately on our X-2 quarter rack Exadata system, which is good – the result you don’t expect is the one that teaches you something new.

This time, I was looking at using a temporary tablespace based on flash disks rather than spinning disks. In the past – using Fusion IO PCI cards, I found that using flash for temp tablespace was very effective in reducing the overhead of multi-pass sorts:

See (http://guyharrison.squarespace.com/ssdguide/04-evaluating-the-options-for-exploiting-ssd.html)

However, when I repeated these tests for Exadata, I got very disappointing results. SSD based temp tablespace actually lead to marginally worse performance:

Looking in depth at a particular point (the 500K SORT_AREA_SIZE point), we can see that although the SSD based temp tablespace has marginally better read times, it involves a significantly higher write overhead:

I can understand the higher read overhead (at least partially). It’s Yet Another time when sequential write operations to an SSD device have provided disappointing performance. However, it’s strange to see such poor read performance. How can a spinning disk serve blocks up at effectively the same latency an SSD?

So I dumped all the direct path read waits from a 10046 trace and plotted them logarithmically:

We can see in this chart, that the SDD based tablespace suffers from a small “spike” of high latencies between 600-1000 us (eg .6-1 ms). These are extremely high latencies for an SSD ! What could be causing them? Garbage collection being caused by the almost writes to the temp tablespaces? There was negliglbe concurrent activity on the system and the table concerned had flash cache disabled so for now that is my #1 theory.

For that matter, why are the HDD reads times so low? An average disk read latency of 500 us for a spinning disk is unreasonably low, is the storage cell somehow buffering temporary tablespace IO?

As always I’m wondering if there’s someone with more expertise in Exadata internals who could shed some light on all of this!

↧

Troubleshooting Oracle DBFS mount issues

March 13, 2014, 5:11 am

≫ Next: Migrating a Database to an Exadata Machine

≪ Previous: Using SSD for a temp tablespace on Exadata

On Exadata the local drives on the compute nodes are not big enough to allow larger exports and often dbfs is configured. In my case I had a 1.2 TB dbfs file system mounted under /dbfs_direct/.

While I was doing some exports yesterday I found that my dbfs wasn’t mounted, running quick crsctl command to bring it online failed:

[oracle@exadb01 ~]$ crsctl start resource dbfs_mount -n exadb01
 CRS-2672: Attempting to start 'dbfs_mount' on 'exadb01'
 CRS-2674: Start of 'dbfs_mount' on 'exadb01' failed
 CRS-2679: Attempting to clean 'dbfs_mount' on 'exadb01'
 CRS-2681: Clean of 'dbfs_mount' on 'exadb01' succeeded
 CRS-4000: Command Start failed, or completed with errors.

It doesn’t give you any error messages or reason why it’s failing, neither the other database and grid infrastructure logs does. The only useful solution is to enable tracing for dbfs client and see what’s happening. To enable tracing edit the mount script and insert the following MOUNT_OPTIONS:

vi $GI_HOME/crs/script/mount-dbfs.sh
MOUNT_OPTIONS=trace_level=1,trace_file=/tmp/dbfs_client_trace.$$.log,trace_size=100

Now start the resource one more time to get the log file generated. You can get this working with the client as well from the command line:

[oracle@exadb01 ~]$ dbfs_client dbfs_user@ -o allow_other,direct_io,trace_level=1,trace_file=/tmp/dbfs_client_trace.$$.log /dbfs_direct
Password:
Fail to connect to database server.

After checking the log file it’s clear now why dbfs was failing to mount, the dbfs database user has expired:

tail /tmp/dbfs_client_trace.100641.log.0
 [43b6c940 03/12/14 11:15:01.577723 LcdfDBPool.cpp:189         ] ERROR: Failed to create session pool ret:-1
 [43b6c940 03/12/14 11:15:01.577753 LcdfDBPool.cpp:399         ] ERROR: ERROR 28001 - ORA-28001: the password has expired

[43b6c940 03/12/14 11:15:01.577766 LcdfDBPool.cpp:251         ] DEBUG: Clean up OCI session pool...
 [43b6c940 03/12/14 11:15:01.577805 LcdfDBPool.cpp:399         ] ERROR: ERROR 24416 - ORA-24416: Invalid session Poolname was specified.

[43b6c940 03/12/14 11:15:01.577844 LcdfDBPool.cpp:444         ] CRIT : Fail to set up database connection.

The account had a default profile which had the default PASSWORD_LIFE_TIME of 180 days:

SQL> select username, account_status, expiry_date, profile from dba_users where username='DBFS_USER';

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE       PROFILE
------------------------------ -------------------------------- ----------------- ------------------------------
DBFS_USER                      EXPIRED                          03-03-14 14:56:12 DEFAULT

Elapsed: 00:00:00.02
SQL> select password from sys.user$ where name= 'DBFS_USER';

PASSWORD
------------------------------
A4BC1A17F4AAA278

Elapsed: 00:00:00.00
SQL> alter user DBFS_USER identified by values 'A4BC1A17F4AAA278';

User altered.

Elapsed: 00:00:00.03
SQL> select username, account_status, expiry_date, profile from dba_users where username='DBFS_USER';

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE       PROFILE
------------------------------ -------------------------------- ----------------- ------------------------------
DBFS_USER                      OPEN                             09-09-14 11:09:43 DEFAULT


SQL> select * from dba_profiles where resource_name = 'PASSWORD_LIFE_TIME';

PROFILE                        RESOURCE_NAME                    RESOURCE LIMIT
------------------------------ -------------------------------- -------- ----------------------------------------
DEFAULT                        PASSWORD_LIFE_TIME               PASSWORD 180

After resetting database user password dbfs successfully mounted!

If you are using dedicated database for dbfs make sure you have set the password_life_time to unlimited to avoid similar issues.

↧

Migrating a Database to an Exadata Machine

April 17, 2014, 6:20 pm

≫ Next: ASM Diskgroup shows USABLE_FILE_MB value in Negative

≪ Previous: Troubleshooting Oracle DBFS mount issues

We have been migrating our databases from the non-Exadata servers to the Exadata Database Machine using the “RMAN 11g Duplicate standby from Active database” command, to create the standby databases on the Exadata machine. Below are the steps which were performed for these successful migrations. Assumptions Here we assume that the following tasks has been […]

The post Migrating a Database to an Exadata Machine appeared first on VitalSoftTech.

↧

ASM Diskgroup shows USABLE_FILE_MB value in Negative

September 17, 2014, 9:06 am

≫ Next: Oracle Enterprise Manager–Based Patching

≪ Previous: Migrating a Database to an Exadata Machine

Today while working on ASM diskgroup i noticed Negative value for USABLE_FILE_MB. I was little surprised as it has been pretty long that i worked on ASM. So i started looking around for blogs and mos docs and found few really nice one around. A negative value for USABLE_FILE_MB means that you do not have [&hellip

↧

Oracle Enterprise Manager–Based Patching

September 30, 2014, 6:50 am

≫ Next: How Can I Compress Thee

≪ Previous: ASM Diskgroup shows USABLE_FILE_MB value in Negative

Friends,

Hope those of you who could make it, are enjoying Oracle Open World (OOW).

An interesting OOW session my colleague Hari Srinivasan pointed out for Wednesday Oct 1 this week:

Databases to Oracle Exadata: The Saga Continues for Oracle Enterprise Manager–Based Patching

    Brian Bong, Director, Database & Analytics Architectuire, Walgreens Corp
    Dee Hicks, Manager, Database Management, Deloitte Consulting LLP
    Hari Srinivasan, Consulting Product Manager, Oracle

10:15 AM - 11:00 AM Moscone South - 300 CON8121

The link for this session is https://oracleus.activeevents.com/2014/connect/sessionDetail.ww?SESSION_ID=8121

Regards,

Porus.

↧

How Can I Compress Thee

February 3, 2015, 9:26 am

≫ Next: Enhancements to the Oracle Multitenant option in Oracle Database 12.1.0.2 – Part I

≪ Previous: Oracle Enterprise Manager–Based Patching


“You can swim all day in the Sea of Knowledge and not get wet.” 
― Norton Juster, The Phantom Tollbooth

In previous posts compression options have been discussed, and now it’s time to see how Oracle performs basic compression. It isn’t really compression, it’e de-duplication, but it does result in space savings for data that won’t be modified after it’s ‘compressed’. Let’s look at how Oracle saves space with your data.

Oracle de-duplicates the data by finding common strings, tokenizing them and using the token identifier in the string to reduce the row length. So, what does that mean? Looking at an example might help; a table is built and populated as follows:


--
-- Create and populate the table
--
create table comptst(
	tstcol1	varchar2(4),
	tstcol2 varchar2(6),
	tstcol3	number(8,2),
	tstcol4	varchar2(10));

insert into comptst
values ('ZZXZ', 'bbddff', 199.44, 'PENDING');

insert into comptst
values ('ZZXZ', 'ghijkl', 43.08, 'PENDING');

insert into comptst
values ('ZZXZ', 'bbddff', 881.02, 'PENDING');

insert into comptst
values ('ZZXZ', 'bbddff', 54.97, 'PENDING');

commit;

insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;

commit;
				--
-- Compress the table with BASIC compression
--
alter table comptst compress basic;
alter table comptst move;

[The table was compressed after the data was inserted which required two steps, the first to set the compression level and the second, a table move in place, to actually compress the data. Had the table been built as compressed and direct path inserts used the data would have been compressed without further action.] Since the initial 4-row insert was re-inserted multiple times there is a lot of duplication in the data, and since Oracle de-duplicates rows to produce the effect of compression there should be a lot of data in a block dump indicating this. There is, and the first piece of that data is the following line:


  perm_9ir2[4]={ 0 2 3 1 }

Oracle builds a token table for each data block; this provides a reference for each data string that occurs more than once in the block. Additionally Oacle can re-arrange the column values in that token table so that multiple column values can be turned into a single token and, thus, a single reference. The line shown above indicates what column values map to the table positions in the token table for this block; in this case column 0 maps to the data in table column 0, column 1 maps to the data in table column 2, column 2 maps to data column 3 and column 3 maps to data column 1. Let’s look at the unique data that was inserted:


('ZZXZ', 'bbddff', 199.44, 'PENDING');
('ZZXZ', 'ghijkl', 43.08, 'PENDING');
('ZZXZ', 'bbddff', 881.02, 'PENDING');
('ZZXZ', 'bbddff', 54.97, 'PENDING');

Since these data rows are duplicated in each block every column is a potential compression token. Two values occur in every row, ‘ZZXZ’ and ‘PENDING’, so it should be expected that tokens for those values will be found in each of the compressed data rows. As mentioned previously Oracle builds a token table in each block so there are two tables in this block, the first, starting at offset 0, is the token table that has 7 rows and the second, starting at offset 7, is the actual table data and there are 721 rows:


0x24:pti[0]	nrow=7		offs=0
0x28:pti[1]	nrow=721	offs=7

Oracle has a clue with this implementation of compression and can create a token that includes a data value and a token, from the same token table, to reduce that row length even further. The examples provided here won’t be demonstrating that but know that it is possible. Now let’s look at the first row in this block for the data table:


tab 1, row 0, @0x1f31
tl: 5 fb: --H-FL-- lb: 0x0  cc: 4
col  0: [ 4]  5a 5a 58 5a
col  1: [ 7]  50 45 4e 44 49 4e 47
col  2: [ 6]  62 62 64 64 66 66
col  3: [ 3]  c1 37 62
bindmp: 2c 00 01 04 02

The actual column lengths are supplied between the square brackets for each column; the total length should be the sum of those values plus 7 bytes, 4 of those for the column lengths, one for the lock byte, one for the flag byte and one for the column count. Using that information the total length should be 24 bytes; the block dump provides a different total length of 5, as reported by the tl entry. There is a line at the end of the row dump labeled bindmp (a binary dump of the row contents) revealing the actual contents of those 5 bytes. As expected there is the lock byte (0x2c), the number of columns at this location (0x01) and two bytes representing the token, reporting that 4 columns are in this token and that the reference row in the token table is row 2. So, let’s look at table 0, row 2:


tab 0, row 2, @0x1f5c
tl: 10 fb: --H-FL-- lb: 0x0  cc: 4
col  0: [ 4]  5a 5a 58 5a
col  1: [ 7]  50 45 4e 44 49 4e 47
col  2: [ 6]  62 62 64 64 66 66
col  3: [ 3]  c1 37 62
bindmp: 00 b3 04 04 05 06 cb c1 37 62

It looks almost like the data row, but the total token length is 10 bytes. Looking at the bindmp the first two bytes indicate this token is used 179 times in this block, the third byte indicates that 4 columns are in this token, the two bytes after that report that the first two columns are also tokens, 0x04 and 0x05. Going back to the token table we see that those tokens are:


tab 0, row 4, @0x1f66
tl: 7 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 4]  5a 5a 58 5a
bindmp: 00 04 cc 5a 5a 58 5a
tab 0, row 5, @0x1f76
tl: 10 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 7]  50 45 4e 44 49 4e 47
bindmp: 00 04 cf 50 45 4e 44 49 4e 47

These are single-column tokens, and each is used 4 times in this block. This is how Oracle reduced the row length from 24 bytes to 5 to save block space. Working through the block dump it’s now possible to re-construct the 24 bytes of data the row originally contained even though it now is only 5 bytes in length.

We see that Oracle doesn’t actually compress data, it replaces duplicate values with tokens and, through those tokens, reconstructs the data at query time by using the row directory and the actual row pieces in each block. Depending on the select list some tokens won’t be accessed if that data isn’t required. Of course all of this re-constructing can be expensive at the CPU level, and for full table scans of large tables performance can be an issue, especially with the “cache buffers chains” latch because Oracle is performing fewer “consistent gets – examination”. This is because Oracle has to pin blocks for a longer period due to the reconstruction. On the plus side the number of physical reads can decrease since the data is in a smaller number of blocks and can stay in the cache longer. Using basic compression is a trade-off between size and performance, and for extremely large tables or in cases where the compression savings are quite large (meaning the data is compressed more) queries may become CPU-intensive rather than I/O intensive. The good and the bad need to be weighed carefully when making the decision to use compression; choose wisely. Space is relatively inexpensive when compared to end-user satisfaction. The DBA’s idea of performance and the end-users ideas of performance use different criteria, and it’s really the end-users idea that should take precendence.

Anyone up for a swim?

↧

Enhancements to the Oracle Multitenant option in Oracle Database 12.1.0.2 – Part I

February 14, 2015, 11:16 pm

≫ Next: Exadata 12c New Features RMOUG Slides

≪ Previous: How Can I Compress Thee

Oracle Database 12.1.0.2 delivers several noteworthy enhancements to the Oracle Multitenant Option of the Oracle Database. We will begin to take a look at these enhancements in our blog posts.

In Oracle Database version 12.1.0.2, there is a new and powerful feature which is quite useful. Remote cloning of pluggable databases is now fully functional in this version. This uses database links between the container databases.

The capability of Remote cloning results in considerable data mobility benefits when using the Multitenant option. PDBs can now be relocated on desire if using a hybrid cloud model (for e.g. take the scenario where your production databases are set up on-premise, and your development and test databases are set up in the public cloud).

The remote cloning capability includes snapshot cloning. Snapshot cloning is supported on any dNFS file system. Note that the “CLONEDB” database initialization parameter must be set to true for snapshot cloning to work on a dNFS system. If CLONEDB is set to false, then snapshot cloning will only work when the file system supports storage snapshots, as in the case of Oracle Automatic Storage Management Cluster File System (Oracle ACFS – now called Oracle CloudFS) and also Direct NFS Client storage (for e.g. Sun ZFS Storage Appliance, NetApp, EMC).

A non-CDB can also be cloned across the database link, but snapshot cloning is not possible in this case.

The command used is the “create pluggable database <pdb name>” with the create pdb clone construct. This construct would look like “from <source pdb / noncdb>@dblink” with or without the “snapshot copy” clause. For example, in the container database CDB1, issue the following command to clone from the SALES PDB in the remote CDB2:

create pluggable database SALESTEST from SALES@CDB2;

There is also a NO DATA clause that can be used when cloning PDBs (but not non-CDBs). This is also an enhancement in 12.1.0.2 and allows you to make a metadata-only clone. The clone is created with the same data model as the source, but no data. This would be useful in quickly creating development environments with the same schema as the production PDB but without the production data.

One restriction of using dNFS and CLONEDB is that the source PDB must be read-only. This means you cannot take snapshots directly from a production database for development and testing, but from a copy of the production database that has been masked fully for its confidential data content, which is the correct thing to do. Note that as long as the clones exist, it is necessary for the source PDB to remain open in the read-only mode.

If the database uses ASM, then snapshot cloning is supported in this case only on Exadata and not on ASM without Exadata at this point of time.

The full syntax of the cloning is explained in the documentation here, along with other restrictions.

↧

Exadata 12c New Features RMOUG Slides

February 23, 2015, 6:33 am

≫ Next: How to configure Link Aggregation Control Protocol on Exadata

≪ Previous: Enhancements to the Oracle Multitenant option in Oracle Database 12.1.0.2 – Part I

I've finally gotten around to post my RMOUG Slide Deck on Slideshare. Hopefully this is helpful to folks looking at new features in Exadata.

↧

How to configure Link Aggregation Control Protocol on Exadata

May 13, 2015, 3:59 am

≫ Next: applyElasticConfig.sh fails with Unable to locate any IB switches

≪ Previous: Exadata 12c New Features RMOUG Slides

During a recent X5 installation I had to configure Link Aggregation Control Protocol (LACP) on the client network of the compute nodes. Although the ports were running at 10Gbits and default configuration of Active/Passive works perfectly fine the customer wanted even distribution of traffic and workload across their core switches.

Link Aggregation Control Protocol (LACP), also known as 802.3ad is a methods of combining multiple physical network connections into one logical connection to increase throughput and provide redundancy in case one of the links should fail. The protocol requires both – the server and the switch(es) to have the same settings to allow LACP to work properly.

To configure LACP on Exadata you need to change the bondeth0 parameters.

On each of the compute nodes open the following file:

/etc/sysconfig/network-scripts/ifcfg-bondeth0

and replace the line saying BONDING_OPTS with this one:

BONDING_OPTS="mode=802.3ad xmit_hash_policy=layer3+4 miimon=100 downdelay=200 updelay=5000 num_grat_arp=100"

and then restart the network interface:

ifdown bondeth0
ifup bondeth0
Determining if ip address 192.168.1.10 is already in use for device bondeth0...

You can check the status of the interface by query the proc filesystem. Make sure both interfaces are up and running at the same speed. The esential part to make sure the LACP is working is shown below:

cat /proc/net/bonding/bondeth0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 33
Partner Key: 34627
Partner Mac Address: 00:23:04:ee:be:c8

I had a problem with the network where the client network did NOT come up after server reboot. This was happening because during system boot the 10Gbit interfaces goes through multiple resets causing very fast link change. Here is the status of the bond as of that time:

cat /proc/net/bonding/bondeth0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
bond bondeth0 has no active aggregator

The solution for that was to decrease the down_delay to 200. The issue is described in this note:

Bonding Mode 802.3ad Using 10Gbps Network – Slave NICs Fail to Come Up Consistently after Reboot (Doc ID 1621754.1)

↧

applyElasticConfig.sh fails with Unable to locate any IB switches

May 15, 2015, 5:55 am

≫ Next: How to configure Power Distribution Units on Exadata X5

≪ Previous: How to configure Link Aggregation Control Protocol on Exadata

With the release of Exadata X5 Oracle introduced elastic configurations and changed the process on how the initial configuration is performed. Back before you had to run applyconfig.sh which would go across the nodes and change all the settings according to your config. This script has now evolved and it’s called applyElasticConfig.sh which is part of OEDA (onecommand). During one of the recent deployments I ran into the below problem:

[root@node8 linux-x64]# ./applyElasticConfig.sh -cf Customer-exa01.xml

Applying Elastic Config...
Applying Elastic configuration...
Searching Subnet 172.16.2.x..........
5 live IPs in 172.16.2.x.............
Exadata node found 172.16.2.46.
Collecting diagnostics...
Errors occurred. Send /opt/oracle.SupportTools/onecommand/linux-x64/WorkDir/Diag-150512_160716.zip to Oracle to receive assistance.
Exception in thread "main" java.lang.NullPointerException
at oracle.onecommand.commandexec.utils.CommonUtils.getStackFromException(CommonUtils.java:1579)
at oracle.onecommand.deploy.cliXml.ApplyElasticConfig.doDaApply(ApplyElasticConfig.java:105)
at oracle.onecommand.deploy.cliXml.ApplyElasticConfig.main(ApplyElasticConfig.java:48)

Going through the logs we can see the following message:

2015-05-12 16:07:16,404 [FINE ][ main][ OcmdException:139] OcmdException from node node8.my.company.com return code = 2 output string: Unable to locate any IB switches... stack trace = java.lang.Throwable

The problem was caused because of IB switch names in my OEDA XML file were different to the one’s actually physically in the rack, actually the IB switch hostnames were missing from the hosts file. So if you ever run into this problem make sure your IB switch hosts file (/etc/hosts) has the correct hostname in the proper format:

#IP                 FQDN                      ALIAS
192.168.1.100       exa01ib01.local.net       exa01ib01

Also make sure to reboot the IB switch after any change of the hosts file.

↧

How to configure Power Distribution Units on Exadata X5

June 18, 2015, 5:00 am

≫ Next: How do I change DNS servers on Exadata storage servers

≪ Previous: applyElasticConfig.sh fails with Unable to locate any IB switches

I’ve done several Exadata deployments recently and I have to say of all the components PDUs were hardest to configure. Important to notice that unlike earlier generations of Exadata the PDUs in X5 are Ehnanced PDUs and not Standard.

Reading the public documentation (Configuring the Power Distribution Units) it says that on PDUs with three power input leads you need to connect the middle power lead to the power source. Well I’ve done that many times and it didn’t NOT worked, the documentation says that the PDU should be accessible on 192.168.0.1. I believe the reason for that is because DHCP has been enabled by default and this can be easily confirmed by checking the LCD screen of the PDU. I even tried setting up DHCP server myself to make the PDU acquire IP but that didn’t worked either.

To configure the PDU you need to connect through serial management port. Nowadays there are no more laptops with serial ports so you will need USB to RS-232 DB9 Serial Adapter, I bought mine from Amazon. You will also need DB9 to RJ45 cable – these are quite popular and I’m sure you’ve seen before the blue Cisco console cable.

You need to connect the cable to SET MGT port of PDU and then establish terminal connection (you can use putty too) with the following settings:
9600 baud, 8 bit, 1 stop bit, no parity bit, no flow control

The username is admin and the password is adm1n.

Here are the commands you need to configure the PDU. Each network change requires reboot of the PDU:

Welcome to Oracle PDU

pducli->username: admin
pducli->password: *****
Login OK - Admin rights!
pducli->

set pdu_name=exa01pdu01
set systime_manual_date=2015-06-18
set systime_manual_time=12:45:00
set systime_ntp_server_enable=On
set systime_ntp_server=192.168.1.2
set systime_dst_enable=On
set net_ipv4_dhcp=Off
reset=yes

set net_ipv4_ipaddr=192.168.1.10
set net_ipv4_subnet=255.255.255.0
set net_ipv4_gateway=192.168.1.1
set net_ipv4_dns1=192.168.1.3
set net_ipv4_dns2=192.168.1.4
reset=yes

Regarding the network connectivity – the documentation says you need two additional cables from your management network. However if you have half or quarter rack you can plug-in the PDU network connections to the management/cisco switch. Make a note that if you ever plan to upgrade to full rack you will have to provide the two additional cables from your management network and disconnect PDUs from the management switch.

IMPORTANT: Make sure you don’t leave any active CLI sessions, otherwise you won’t be able to login remotely and will require data centre visit to reboot the PDU.

↧

How do I change DNS servers on Exadata storage servers

June 19, 2015, 5:12 am

≫ Next: dbnodeupdate.sh post upgrade step fails on Exadata storage software 12.1.2.1.1

≪ Previous: How to configure Power Distribution Units on Exadata X5

This is just a quick post to highlight a problem I had recently on another Exadata deployment.

For the most customers the management network on Exadata is routable and the DNS servers are accessible. However in a recent deployment for a financial organization this wasn’t the case and the storage servers were NOT able to reach the DNS servers. The customer provided a different set of DNS servers within the management network which were still able to resolve all the Exadata hostnames. If you encounter similar problem stop all cell services and run ipconf on each storage server to update the DNS servers.

On each storage server there is a service called cellwall (/etc/init.d/cellwall) which actually will run many checks and apply a lot of iptables rules. Here are couple of comments from the script to give you an idea:

# general lockdown from everything external, (then selectively permit)
  # general permissiveness (localhost: if you are in, you are in)
  # allow all udp traffic only from rdbms hosts on IPoIB only
      # allow DNS to work on all interfaces
      # open sport=53 only for DNS servers (mitigate remote-offlabel-port exploit)

and many more but you can check the script and see what it does OR run iptables -L -n to get all the iptables rules.

Here is some more information on how to change IP addresses on Exadata:
Changing IP addresses on Exadata Database Machine (Doc ID 1317159.1)

UPDATE: Thanks to Jason Arneil for pointing out that proper way to update the configuration of the cell.

↧

dbnodeupdate.sh post upgrade step fails on Exadata storage software 12.1.2.1.1

June 23, 2015, 3:30 am

≫ Next: MGMTDB not automatically created on Exadata X5 and GI 12.1.0.2

≪ Previous: How do I change DNS servers on Exadata storage servers

I’ve done several Exadata deployments in the past two months and had to upgrade the Exadata storage software on half of them. Reason for that was because units shipped before May had their Exadata storage software version of 12.1.2.1.0.

The upgrade process of the database nodes ran fine but when I ran dbnodeupdate.sh -c for completing post upgrade steps I got an error that the system wasn’t on the expected Exadata release or kernel:

(*) 2015-06-01 14:21:21: Verifying GI and DB's are shutdown
(*) 2015-06-01 14:21:22: Verifying firmware updates/validations. Maximum wait time: 60 minutes.
(*) 2015-06-01 14:21:22: If the node reboots during this firmware update/validation, re-run './dbnodeupdate.sh -c' after the node restarts..
(*) 2015-06-01 14:21:23: Collecting console history for diag purposes

ERROR: System not on expected Exadata release or kernel, exiting


ERROR: Correct error, or to override run: ./dbnodeupdate.sh -c -q -t 12.1.2.1.1.150316.2

Indeed, the database node was running the new Exadata software but still using the old kernel (2.6.39-400.243) and dbnodeupdate was expecting me to run the new 2.6.39-400.248 kernel:

imageinfo:
Kernel version: 2.6.39-400.243.1.el6uek.x86_64 #1 SMP Wed Nov 26 09:15:35 PST 2014 x86_64
Image version: 12.1.2.1.1.150316.2
Image activated: 2015-06-01 12:27:57 +0100
Image status: success
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

The reason for that was that the previous run of dbnodeupdate installed the new kernel package but failed to update grub.conf. The solution is to manually add the missing kernel entry to grub.conf and reboot the server to pick up the new kernel, here is a note for more information which by the time I had this problem was still internal:

Dbnodeupdate.sh Finishes With Error: System Not On Expected Exadata Release Or Kernel, Exiting (Doc ID 2007282.1)
Bug 20708183 – DOMU:GRUB.CONF KERNEL NOT ALWAYS UPDATED GOING TO 121211, NEW KERNEL NOT BOOTED

↧

MGMTDB not automatically created on Exadata X5 and GI 12.1.0.2

July 1, 2015, 4:39 am

≫ Next: You will now be able to use Exadata on the Oracle Database Cloud

≪ Previous: dbnodeupdate.sh post upgrade step fails on Exadata storage software 12.1.2.1.1

While deploying an X5 Full Rack recently it happened that the Grid Infrastructure Management Repository was not created by onecommand. The GIMR database was optional in 12.1.0.1 and became mandatory in 12.1.0.2 and should be automatically installed with Oracle Grid Infrastructure 12c release 1 (12.1.0.2). For unknown reason to me that didn’t happen and I had to create it manually. I’ve checked all the log files but couldn’t find any errors. For reference the OEDA version used was Feb 2015 v15.050, image version on the Exadata was 12.1.2.1.0.141206.1.

To create the database login as the grid user and create file holding the following variables:

cat > /tmp/cfgrsp.properties
oracle.assistants.asm|S_ASMPASSWORD=[your ASM password]
oracle.assistants.asm|S_ASMMONITORPASSWORD=[your ASM password]

and run the following command:

GRID_HOME=/u01/app/12.1.0.2/grid
[oracle@exa01 ~]$ $GRID_HOME/cfgtoollogs/configToolAllCommands RESPONSE_FILE=/tmp/cfgrsp.properties

For reference, here is similar bug I found on MOS:
-MGMTDB Not Created When Using EM12c Provisioning (Doc ID 1983885.1)

↧