OraFAQ Forum: SQL & PL/SQL » Large volume data compare

Home » SQL & PL/SQL » SQL & PL/SQL » Large volume data compare - NEED HELP (Oracle 11)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Large volume data compare - NEED HELP [message #665643]

Wed, 13 September 2017 22:06

Member2014
Messages: 33
Registered: July 2014
Location: USA

Member

Hello,

I need help with a solution we are trying to build. We have a process which drops and recreates 2 large tables (each contains about 200MM records and about 800 columns) daily.
I need to extract records from each of the tables daily after they are built if a subset of column values for the primary key has changed. The two issues I have is:
1. Since its a drop and recreate, we currently do not have daily backups of the table to compare
2. its a very large volume and so not sure the best way to compare data

Below is a sample of one of the tables.

table name: 
create table name_address
(id varchar2(100),
fname varchar2(100),
lname varchar2(200),
address1 varchar2(500),
address2 varchar2(500),
city varchar2(50),
state varchar2(50),
zip varchar2(50));

Extract layout is as below. We need to pull records in the file for any ID's when there is a change in city or state or zip.
The column action indicates I - if its a new record U - if any of the values indicated above have been updated

id, fname, lname, city, state, zip, action

I will need to do the following -

1. After the extract for a given day is completed, create a table daily with the subset of columns needed for comparison say prev_name_address
2. Next day once the table has been built compare the two tables prev_name_address and name_address against specific columns (city/ state/ zip) to determine the new/ updated records

My problem areas -

1. What is the best way to back up a subset of data for a table which is so large (~200MM records with ~700 columns)
2. What is the best way to compare the backup vs production table to identify the inserts/updates based on the city/ state/ zip value changes

Any help is greatly appreciated. Thank you in advance.

Report message to a moderator

Large volume data compare - NEED HELP [message #665644 is a reply to message #665643]

Wed, 13 September 2017 22:06

Member2014
Messages: 33
Registered: July 2014
Location: USA

Member

table name: 
create table name_address
(id varchar2(100),
fname varchar2(100),
lname varchar2(200),
address1 varchar2(500),
address2 varchar2(500),
city varchar2(50),
state varchar2(50),
zip varchar2(50));

id, fname, lname, city, state, zip, action

I will need to do the following -

1. After the extract for a given day is completed, create a table daily with the subset of columns needed for comparison say prev_name_address
2. Next day once the table has been built compare the two tables prev_name_address and name_address against specific columns (city/ state/ zip) to determine the new/ updated records

My problem areas -

1. What is the best way to back up a subset of data for a table which is so large (~200MM records with ~700 columns)
2. What is the best way to compare the backup vs production table to identify the inserts/updates based on the city/ state/ zip value changes

Any help is greatly appreciated. Thank you in advance.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665646 is a reply to message #665644]

Wed, 13 September 2017 22:28

BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal

Senior Member

I may just be old & senile but I could benefit from clarifications.
BTW - tables don't contain any records; only rows.
1) in integer value what does 200MM equal?
2) on "drop & recreate" what is the data source for the data values. (Why not TRUNCATE instead?)
3) You say that 2 large tables are involved, but I am unclear what the difference is between the 2 tables. (Why 2 tables?)
4) AFAIK SQL does not have any "extract" statement; so what exactly needs to be done.
5) I am unclear what is being compared to what?

A wise man once gave me the following advice.
First make it work, then make it fancy or fast.

If you post all the necessary SQL that successfully completes the task, I am sure you will get ideas on how to make it faster!

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665647 is a reply to message #665646]

Wed, 13 September 2017 22:56

Member2014
Messages: 33
Registered: July 2014
Location: USA

Member

Apologize if its a little unclear. I'm still trying to figure out how to implement but hope this helps -

1) in integer value what does 200MM equal?
200,000,000 i.e. 200 million
2) on "drop & recreate" what is the data source for the data values. (Why not TRUNCATE instead?)
We have multiple data sources used to build this table and a series of logic involved. The process i inherited had been built to drop an create. I'm not sure what the though process was when this was built. Definately an optimization that can be looked into
3) You say that 2 large tables are involved, but I am unclear what the difference is between the 2 tables. (Why 2 tables?)
One table contains name/address data and the other email. For the purpose of this example I have given a sample of one of the tables. Once I have the logic for one of the tables it should be a simple replication for the other
4) AFAIK SQL does not have any "extract" statement; so what exactly needs to be done.
5) I am unclear what is being compared to what?
I have put in code for a simple way to generate the extract. This will generate the result but probably not the most optimized way.

1. Back up table with necessary columns to compare (day 1)

CREATE TABLE PREV_NAME_ADDRESS PARALLEL 4
AS SELECT id, fname, lname, city, state, zip FROM NAME_ADDRESS;

2. Once the NAME_ADDRESS table is built (day 2) compare the table with PREV_NAME_ADDRESS on a subset of columns i.e. city/ state/ zip

CREATE TABLE NAME_ADDRESS_EXTRACT PARALLEL 4 
AS SELECT id, fname, lname, city, state, zip,
	CASE WHEN n.id IS NOT NULL AND o.id IS NULL THEN 'I' ELSE 'U' END IF AS action
FROM NAME_ADDRESS n 
LEFT OUTER JOIN PREV_NAME_ADDRESS o
ON n.id = o.id
WHERE n.city <> o.city
AND n.state <> o.state
AND n.zip <> o.zip;

3. Generate an extract of the NAME_ADDRESS_EXTRACT table once its built to a flat file

 SELECT * FROM NAME_ADDRESS_EXTRACT;

The logic for populating the "action" column I need to test.

Please note that the name_address table I have built here is just a subset of the table we have on the DB. My biggest concern is the comparison of the columns between name_address and prev_name_address.
Here I have shown 3 columns to be compared out of 8 columns. In my DB table I need to compare 18 columns out of about 700 columns.

Is there a better way like adding indices for faster comparison, hash function, or something else which will make the comparison more efficient and faster?

Thank you again for looking into it.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665648 is a reply to message #665647]

Wed, 13 September 2017 23:14

BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal

Senior Member

I admit I am just dense.
What is being compared to what?
What constitute a match?
What is a mis-match?
Which is more important & what should be done upon detection?

Please stop obsessing on "efficiency"!
If you can't do basic detection, you have nothing to "make faster".

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665656 is a reply to message #665648]

Thu, 14 September 2017 03:37

cookiemonster
Messages: 13943
Registered: September 2008
Location: Rainy Manchester

Senior Member

It'd probably help if you described the business process you're trying to implement, rather than what this particular implementation is trying to do.
There may well be easier/ more efficient ways than creating tables on the fly every day.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665657 is a reply to message #665648]

Thu, 14 September 2017 03:44

quirks
Messages: 82
Registered: October 2014

Member

CREATE TABLE NAME_ADDRESS(
    ID VARCHAR2(100)
   ,FNAME VARCHAR2(100)
   ,LNAME VARCHAR2(200)
   ,ADDRESS1 VARCHAR2(500)
   ,ADDRESS2 VARCHAR2(500)
   ,CITY VARCHAR2(50)
   ,STATE VARCHAR2(50)
   ,ZIP VARCHAR2(50)
);

INSERT INTO NAME_ADDRESS
     VALUES (
                '01'
               ,'01 first name'
               ,'01 last_name'
               ,'01 address1'
               ,'01 address2'
               ,'01 city'
               ,'01 state'
               ,'01 zip'
            );

INSERT INTO NAME_ADDRESS
     VALUES (
                '02'
               ,'02 first name'
               ,'02 last_name'
               ,'02 address1'
               ,'02 address2'
               ,'02 city'
               ,'02 state'
               ,'02 zip'
            );

INSERT INTO NAME_ADDRESS
     VALUES (
                '03'
               ,'03 first name'
               ,'03 last_name'
               ,'03 address1'
               ,'03 address2'
               ,'03 city'
               ,'03 state'
               ,'03 zip'
            );

INSERT INTO NAME_ADDRESS
     VALUES (
                '04'
               ,'04 first name'
               ,'04 last_name'
               ,'04 address1'
               ,'04 address2'
               ,'04 city'
               ,'04 state'
               ,'04 zip'
            );

COMMIT;

CREATE TABLE PREV_NAME_ADDRESS
PARALLEL 4
AS
    SELECT NAME_ADDRESS.ID
          ,NAME_ADDRESS.FNAME
          ,NAME_ADDRESS.LNAME
          ,NAME_ADDRESS.ADDRESS1
          ,NAME_ADDRESS.ADDRESS2
          ,NAME_ADDRESS.CITY
          ,NAME_ADDRESS.STATE
          ,NAME_ADDRESS.ZIP
          ,'NONE' AS ACTION
      FROM NAME_ADDRESS;

DELETE FROM NAME_ADDRESS
      WHERE ID = '03';

UPDATE NAME_ADDRESS
   SET CITY = 'canged city'
 WHERE ID = '01';

UPDATE NAME_ADDRESS
   SET ADDRESS1 = 'changed address1'
 WHERE ID = '02';

INSERT INTO NAME_ADDRESS
     VALUES (
                '05'
               ,'05 first name'
               ,'05 last_name'
               ,'05 address1'
               ,'05 address2'
               ,'05 city'
               ,'05 state'
               ,'05 zip'
            );

COMMIT;

SELECT COALESCE(NEW_VALS.ID, OLD_VALS.ID) AS ID
      ,COALESCE(NEW_VALS.FNAME, OLD_VALS.FNAME) AS FNAME
      ,COALESCE(NEW_VALS.LNAME, OLD_VALS.LNAME) AS LNAME
      ,COALESCE(NEW_VALS.ADDRESS1, OLD_VALS.ADDRESS1) AS ADDRESS1
      ,COALESCE(NEW_VALS.ADDRESS2, OLD_VALS.ADDRESS2) AS ADDRESS2
      ,COALESCE(NEW_VALS.CITY, OLD_VALS.CITY) AS CITY
      ,COALESCE(NEW_VALS.STATE, OLD_VALS.STATE) AS STATE
      ,COALESCE(NEW_VALS.ZIP, OLD_VALS.ZIP) AS ZIP
      ,CASE
           WHEN NEW_VALS.ID IS NULL
           THEN
               'D'
           WHEN OLD_VALS.ID IS NULL
           THEN
               'I'
           WHEN    NEW_VALS.ID
--                || NEW_VALS.FNAME
--                || NEW_VALS.LNAME
--                || NEW_VALS.ADDRESS1
--                || NEW_VALS.ADDRESS2
                || NEW_VALS.CITY
                || NEW_VALS.STATE
                || NEW_VALS.ZIP =
                   OLD_VALS.ID
--                || OLD_VALS.FNAME
--                || OLD_VALS.LNAME
--                || OLD_VALS.ADDRESS1
--                || OLD_VALS.ADDRESS2
                || OLD_VALS.CITY
                || OLD_VALS.STATE
                || OLD_VALS.ZIP
           THEN
               'NONE'
           ELSE
               'U'
       END
           AS ACTION
  FROM NAME_ADDRESS NEW_VALS FULL OUTER JOIN PREV_NAME_ADDRESS OLD_VALS ON (NEW_VALS.ID = OLD_VALS.ID);

[Updated on: Thu, 14 September 2017 03:46]

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665659 is a reply to message #665657]

Thu, 14 September 2017 05:33

John Watson
Messages: 8951
Registered: January 2010
Location: Global Village

Senior Member

If all you want to do is capture changes, it might be simpler to write a trigger to do this. I'm no great fan of triggers, but this could be a reasonable usage case.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665672 is a reply to message #665659]

Thu, 14 September 2017 13:14

Member2014
Messages: 33
Registered: July 2014
Location: USA

Member

BlackSwan - May be i'm not doing a good job explaining here, but what i'm trying to compare two copies of the same table i.e. copy of the name_address table from yesterday (PREV_NAME_ADDRESS) with a copy of the name_address table (NAME_ADDRESS) built today.
I am not obsessing on effeciency but the way I see it thats one of the main concerns I have due to the size of the tables. If not the logic I have listed above will work just fine on a smaller table.

cookiemonster - The business case here is that we have a large table which gets built everyday (drop and create) using a bunch of underlying tables. We need to determine the data for which the address values have changed and generate an extract of those specific records.
This is a rather simple effort with the basic steps outlined above. I am trying to determine the best approach for the volume we will be handling. Hope this makes sense.

John Watson - I don't think triggers can be used here cos this table is a daily drop and create. If it were to be an update that would have been a good solution.

quirks - I had not thought about this approach. I will certainly give it a try.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665673 is a reply to message #665672]

Thu, 14 September 2017 13:27

BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal

Senior Member

Are the ID values really UNIQUE between the tables?

SELECT COL1, ..., COL18 FROM TODAYS_TABLE TT
MINUS
SELECT COL1, ..., COL18 FROM YESTERDAYS_TABLE YT;

results set from above is changed rows.
If all 18 columns are INDEXED, then the 2 tables themselves will NOT be read.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665674 is a reply to message #665673]

Thu, 14 September 2017 21:25

Member2014
Messages: 33
Registered: July 2014
Location: USA

Member

Thanks BlackSwan I will try this approach.

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665678 is a reply to message #665674]

Fri, 15 September 2017 02:31

quirks
Messages: 82
Registered: October 2014

Member

@BlackSwan: You might not be able to identify the deleted rows.

@Member2014: I just found this promising approach: http://blog.sqlora.com/en/how-to-simplify-the-data-historization/
Below is an adaption to your case. I'd be glad if you could give me a feed back which of my versions is faster. This one should be (because of skipping the full outer join), but you never know.

WITH
    UNION_OLD_NEW
    AS
        (SELECT 'NEW' SOURCE
               ,NAME_ADDRESS.ID
               ,NAME_ADDRESS.FNAME
               ,NAME_ADDRESS.LNAME
               ,NAME_ADDRESS.ADDRESS1
               ,NAME_ADDRESS.ADDRESS2
               ,NAME_ADDRESS.CITY
               ,NAME_ADDRESS.STATE
               ,NAME_ADDRESS.ZIP
           FROM NAME_ADDRESS
         UNION ALL
         SELECT 'OLD' SOURCE
               ,PREV_NAME_ADDRESS.ID
               ,PREV_NAME_ADDRESS.FNAME
               ,PREV_NAME_ADDRESS.LNAME
               ,PREV_NAME_ADDRESS.ADDRESS1
               ,PREV_NAME_ADDRESS.ADDRESS2
               ,PREV_NAME_ADDRESS.CITY
               ,PREV_NAME_ADDRESS.STATE
               ,PREV_NAME_ADDRESS.ZIP
           FROM PREV_NAME_ADDRESS),
    PREPARE_4_ANALYZE
    AS
        (SELECT UNION_OLD_NEW.*
               ,COUNT(*)
                    OVER(
                        PARTITION BY ID
--                                    ,FNAME
--                                    ,LNAME
--                                    ,ADDRESS1
--                                    ,ADDRESS2
                                    ,CITY
                                    ,STATE
                                    ,ZIP
                    )
                    CNT  --  cnt = 2 - two versions are the same, otherwise INS/UPD/DEL
               ,COUNT(*) OVER (PARTITION BY ID)
                    CNT_KEY  -- the count of versions per ID
           FROM UNION_OLD_NEW),
    EXECUTE_ANALYZE
    AS
        (SELECT PREPARE_4_ANALYZE.*
               ,CASE
                    WHEN (CNT = 1 AND CNT_KEY = 2) THEN 'U'
                    WHEN (CNT = 2 AND CNT_KEY = 2) THEN 'NONE'
                    WHEN (CNT = 1 AND CNT_KEY = 1) THEN 
                         CASE 
                              WHEN SOURCE = 'OLD' THEN 'D' 
                              WHEN SOURCE = 'NEW' THEN 'I' 
                         END
                END
                    AS ACTION
           FROM PREPARE_4_ANALYZE)
SELECT EXECUTE_ANALYZE.ID
      ,EXECUTE_ANALYZE.FNAME
      ,EXECUTE_ANALYZE.LNAME
      ,EXECUTE_ANALYZE.ADDRESS1
      ,EXECUTE_ANALYZE.ADDRESS2
      ,EXECUTE_ANALYZE.CITY
      ,EXECUTE_ANALYZE.STATE
      ,EXECUTE_ANALYZE.ZIP
      ,EXECUTE_ANALYZE.ACTION
  FROM EXECUTE_ANALYZE
 WHERE 
      (SOURCE = 'NEW' AND ACTION IN ('U', 'I', 'NONE')) OR 
      (SOURCE = 'OLD' AND ACTION = 'D') -- filter values

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665704 is a reply to message #665678]

Fri, 15 September 2017 15:41

Member2014
Messages: 33
Registered: July 2014
Location: USA

Member

quirks - This worked perfect.

Another step I added was to create work tables with just the columns I need to compare. The work table creation took about 3 minutes and data comparison for the full volume about 20 mins. Thanks so much for the help. Really appreciate it.

Report message to a moderator

Previous Topic:	Adding ax extra column to the existing constraint
Next Topic:	LPX-00230: invalid character 0 (U+0000) found in a Name or Nmtoken

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Sat Sep 28 14:10:35 CDT 2024