Home » SQL & PL/SQL » SQL & PL/SQL » Large volume data compare - NEED HELP (Oracle 11)
Large volume data compare - NEED HELP [message #665643] |
Wed, 13 September 2017 22:06 |
|
Member2014
Messages: 33 Registered: July 2014 Location: USA
|
Member |
|
|
Hello,
I need help with a solution we are trying to build. We have a process which drops and recreates 2 large tables (each contains about 200MM records and about 800 columns) daily.
I need to extract records from each of the tables daily after they are built if a subset of column values for the primary key has changed. The two issues I have is:
1. Since its a drop and recreate, we currently do not have daily backups of the table to compare
2. its a very large volume and so not sure the best way to compare data
Below is a sample of one of the tables.
table name:
create table name_address
(id varchar2(100),
fname varchar2(100),
lname varchar2(200),
address1 varchar2(500),
address2 varchar2(500),
city varchar2(50),
state varchar2(50),
zip varchar2(50));
Extract layout is as below. We need to pull records in the file for any ID's when there is a change in city or state or zip.
The column action indicates I - if its a new record U - if any of the values indicated above have been updated
id, fname, lname, city, state, zip, action
I will need to do the following -
1. After the extract for a given day is completed, create a table daily with the subset of columns needed for comparison say prev_name_address
2. Next day once the table has been built compare the two tables prev_name_address and name_address against specific columns (city/ state/ zip) to determine the new/ updated records
My problem areas -
1. What is the best way to back up a subset of data for a table which is so large (~200MM records with ~700 columns)
2. What is the best way to compare the backup vs production table to identify the inserts/updates based on the city/ state/ zip value changes
Any help is greatly appreciated. Thank you in advance.
|
|
|
Large volume data compare - NEED HELP [message #665644 is a reply to message #665643] |
Wed, 13 September 2017 22:06 |
|
Member2014
Messages: 33 Registered: July 2014 Location: USA
|
Member |
|
|
Hello,
I need help with a solution we are trying to build. We have a process which drops and recreates 2 large tables (each contains about 200MM records and about 800 columns) daily.
I need to extract records from each of the tables daily after they are built if a subset of column values for the primary key has changed. The two issues I have is:
1. Since its a drop and recreate, we currently do not have daily backups of the table to compare
2. its a very large volume and so not sure the best way to compare data
Below is a sample of one of the tables.
table name:
create table name_address
(id varchar2(100),
fname varchar2(100),
lname varchar2(200),
address1 varchar2(500),
address2 varchar2(500),
city varchar2(50),
state varchar2(50),
zip varchar2(50));
Extract layout is as below. We need to pull records in the file for any ID's when there is a change in city or state or zip.
The column action indicates I - if its a new record U - if any of the values indicated above have been updated
id, fname, lname, city, state, zip, action
I will need to do the following -
1. After the extract for a given day is completed, create a table daily with the subset of columns needed for comparison say prev_name_address
2. Next day once the table has been built compare the two tables prev_name_address and name_address against specific columns (city/ state/ zip) to determine the new/ updated records
My problem areas -
1. What is the best way to back up a subset of data for a table which is so large (~200MM records with ~700 columns)
2. What is the best way to compare the backup vs production table to identify the inserts/updates based on the city/ state/ zip value changes
Any help is greatly appreciated. Thank you in advance.
|
|
|
|
Re: Large volume data compare - NEED HELP [message #665647 is a reply to message #665646] |
Wed, 13 September 2017 22:56 |
|
Member2014
Messages: 33 Registered: July 2014 Location: USA
|
Member |
|
|
Apologize if its a little unclear. I'm still trying to figure out how to implement but hope this helps -
1) in integer value what does 200MM equal?
200,000,000 i.e. 200 million
2) on "drop & recreate" what is the data source for the data values. (Why not TRUNCATE instead?)
We have multiple data sources used to build this table and a series of logic involved. The process i inherited had been built to drop an create. I'm not sure what the though process was when this was built. Definately an optimization that can be looked into
3) You say that 2 large tables are involved, but I am unclear what the difference is between the 2 tables. (Why 2 tables?)
One table contains name/address data and the other email. For the purpose of this example I have given a sample of one of the tables. Once I have the logic for one of the tables it should be a simple replication for the other
4) AFAIK SQL does not have any "extract" statement; so what exactly needs to be done.
5) I am unclear what is being compared to what?
I have put in code for a simple way to generate the extract. This will generate the result but probably not the most optimized way.
1. Back up table with necessary columns to compare (day 1)
CREATE TABLE PREV_NAME_ADDRESS PARALLEL 4
AS SELECT id, fname, lname, city, state, zip FROM NAME_ADDRESS;
2. Once the NAME_ADDRESS table is built (day 2) compare the table with PREV_NAME_ADDRESS on a subset of columns i.e. city/ state/ zip
CREATE TABLE NAME_ADDRESS_EXTRACT PARALLEL 4
AS SELECT id, fname, lname, city, state, zip,
CASE WHEN n.id IS NOT NULL AND o.id IS NULL THEN 'I' ELSE 'U' END IF AS action
FROM NAME_ADDRESS n
LEFT OUTER JOIN PREV_NAME_ADDRESS o
ON n.id = o.id
WHERE n.city <> o.city
AND n.state <> o.state
AND n.zip <> o.zip;
3. Generate an extract of the NAME_ADDRESS_EXTRACT table once its built to a flat file
SELECT * FROM NAME_ADDRESS_EXTRACT;
The logic for populating the "action" column I need to test.
Please note that the name_address table I have built here is just a subset of the table we have on the DB. My biggest concern is the comparison of the columns between name_address and prev_name_address.
Here I have shown 3 columns to be compared out of 8 columns. In my DB table I need to compare 18 columns out of about 700 columns.
Is there a better way like adding indices for faster comparison, hash function, or something else which will make the comparison more efficient and faster?
Thank you again for looking into it.
|
|
|
|
Re: Large volume data compare - NEED HELP [message #665656 is a reply to message #665648] |
Thu, 14 September 2017 03:37 |
cookiemonster
Messages: 13943 Registered: September 2008 Location: Rainy Manchester
|
Senior Member |
|
|
It'd probably help if you described the business process you're trying to implement, rather than what this particular implementation is trying to do.
There may well be easier/ more efficient ways than creating tables on the fly every day.
|
|
|
Re: Large volume data compare - NEED HELP [message #665657 is a reply to message #665648] |
Thu, 14 September 2017 03:44 |
|
quirks
Messages: 82 Registered: October 2014
|
Member |
|
|
CREATE TABLE NAME_ADDRESS(
ID VARCHAR2(100)
,FNAME VARCHAR2(100)
,LNAME VARCHAR2(200)
,ADDRESS1 VARCHAR2(500)
,ADDRESS2 VARCHAR2(500)
,CITY VARCHAR2(50)
,STATE VARCHAR2(50)
,ZIP VARCHAR2(50)
);
INSERT INTO NAME_ADDRESS
VALUES (
'01'
,'01 first name'
,'01 last_name'
,'01 address1'
,'01 address2'
,'01 city'
,'01 state'
,'01 zip'
);
INSERT INTO NAME_ADDRESS
VALUES (
'02'
,'02 first name'
,'02 last_name'
,'02 address1'
,'02 address2'
,'02 city'
,'02 state'
,'02 zip'
);
INSERT INTO NAME_ADDRESS
VALUES (
'03'
,'03 first name'
,'03 last_name'
,'03 address1'
,'03 address2'
,'03 city'
,'03 state'
,'03 zip'
);
INSERT INTO NAME_ADDRESS
VALUES (
'04'
,'04 first name'
,'04 last_name'
,'04 address1'
,'04 address2'
,'04 city'
,'04 state'
,'04 zip'
);
COMMIT;
CREATE TABLE PREV_NAME_ADDRESS
PARALLEL 4
AS
SELECT NAME_ADDRESS.ID
,NAME_ADDRESS.FNAME
,NAME_ADDRESS.LNAME
,NAME_ADDRESS.ADDRESS1
,NAME_ADDRESS.ADDRESS2
,NAME_ADDRESS.CITY
,NAME_ADDRESS.STATE
,NAME_ADDRESS.ZIP
,'NONE' AS ACTION
FROM NAME_ADDRESS;
DELETE FROM NAME_ADDRESS
WHERE ID = '03';
UPDATE NAME_ADDRESS
SET CITY = 'canged city'
WHERE ID = '01';
UPDATE NAME_ADDRESS
SET ADDRESS1 = 'changed address1'
WHERE ID = '02';
INSERT INTO NAME_ADDRESS
VALUES (
'05'
,'05 first name'
,'05 last_name'
,'05 address1'
,'05 address2'
,'05 city'
,'05 state'
,'05 zip'
);
COMMIT;
SELECT COALESCE(NEW_VALS.ID, OLD_VALS.ID) AS ID
,COALESCE(NEW_VALS.FNAME, OLD_VALS.FNAME) AS FNAME
,COALESCE(NEW_VALS.LNAME, OLD_VALS.LNAME) AS LNAME
,COALESCE(NEW_VALS.ADDRESS1, OLD_VALS.ADDRESS1) AS ADDRESS1
,COALESCE(NEW_VALS.ADDRESS2, OLD_VALS.ADDRESS2) AS ADDRESS2
,COALESCE(NEW_VALS.CITY, OLD_VALS.CITY) AS CITY
,COALESCE(NEW_VALS.STATE, OLD_VALS.STATE) AS STATE
,COALESCE(NEW_VALS.ZIP, OLD_VALS.ZIP) AS ZIP
,CASE
WHEN NEW_VALS.ID IS NULL
THEN
'D'
WHEN OLD_VALS.ID IS NULL
THEN
'I'
WHEN NEW_VALS.ID
-- || NEW_VALS.FNAME
-- || NEW_VALS.LNAME
-- || NEW_VALS.ADDRESS1
-- || NEW_VALS.ADDRESS2
|| NEW_VALS.CITY
|| NEW_VALS.STATE
|| NEW_VALS.ZIP =
OLD_VALS.ID
-- || OLD_VALS.FNAME
-- || OLD_VALS.LNAME
-- || OLD_VALS.ADDRESS1
-- || OLD_VALS.ADDRESS2
|| OLD_VALS.CITY
|| OLD_VALS.STATE
|| OLD_VALS.ZIP
THEN
'NONE'
ELSE
'U'
END
AS ACTION
FROM NAME_ADDRESS NEW_VALS FULL OUTER JOIN PREV_NAME_ADDRESS OLD_VALS ON (NEW_VALS.ID = OLD_VALS.ID);
[Updated on: Thu, 14 September 2017 03:46] Report message to a moderator
|
|
|
|
|
|
|
Re: Large volume data compare - NEED HELP [message #665678 is a reply to message #665674] |
Fri, 15 September 2017 02:31 |
|
quirks
Messages: 82 Registered: October 2014
|
Member |
|
|
@BlackSwan: You might not be able to identify the deleted rows.
@Member2014: I just found this promising approach: http://blog.sqlora.com/en/how-to-simplify-the-data-historization/
Below is an adaption to your case. I'd be glad if you could give me a feed back which of my versions is faster. This one should be (because of skipping the full outer join), but you never know.
WITH
UNION_OLD_NEW
AS
(SELECT 'NEW' SOURCE
,NAME_ADDRESS.ID
,NAME_ADDRESS.FNAME
,NAME_ADDRESS.LNAME
,NAME_ADDRESS.ADDRESS1
,NAME_ADDRESS.ADDRESS2
,NAME_ADDRESS.CITY
,NAME_ADDRESS.STATE
,NAME_ADDRESS.ZIP
FROM NAME_ADDRESS
UNION ALL
SELECT 'OLD' SOURCE
,PREV_NAME_ADDRESS.ID
,PREV_NAME_ADDRESS.FNAME
,PREV_NAME_ADDRESS.LNAME
,PREV_NAME_ADDRESS.ADDRESS1
,PREV_NAME_ADDRESS.ADDRESS2
,PREV_NAME_ADDRESS.CITY
,PREV_NAME_ADDRESS.STATE
,PREV_NAME_ADDRESS.ZIP
FROM PREV_NAME_ADDRESS),
PREPARE_4_ANALYZE
AS
(SELECT UNION_OLD_NEW.*
,COUNT(*)
OVER(
PARTITION BY ID
-- ,FNAME
-- ,LNAME
-- ,ADDRESS1
-- ,ADDRESS2
,CITY
,STATE
,ZIP
)
CNT -- cnt = 2 - two versions are the same, otherwise INS/UPD/DEL
,COUNT(*) OVER (PARTITION BY ID)
CNT_KEY -- the count of versions per ID
FROM UNION_OLD_NEW),
EXECUTE_ANALYZE
AS
(SELECT PREPARE_4_ANALYZE.*
,CASE
WHEN (CNT = 1 AND CNT_KEY = 2) THEN 'U'
WHEN (CNT = 2 AND CNT_KEY = 2) THEN 'NONE'
WHEN (CNT = 1 AND CNT_KEY = 1) THEN
CASE
WHEN SOURCE = 'OLD' THEN 'D'
WHEN SOURCE = 'NEW' THEN 'I'
END
END
AS ACTION
FROM PREPARE_4_ANALYZE)
SELECT EXECUTE_ANALYZE.ID
,EXECUTE_ANALYZE.FNAME
,EXECUTE_ANALYZE.LNAME
,EXECUTE_ANALYZE.ADDRESS1
,EXECUTE_ANALYZE.ADDRESS2
,EXECUTE_ANALYZE.CITY
,EXECUTE_ANALYZE.STATE
,EXECUTE_ANALYZE.ZIP
,EXECUTE_ANALYZE.ACTION
FROM EXECUTE_ANALYZE
WHERE
(SOURCE = 'NEW' AND ACTION IN ('U', 'I', 'NONE')) OR
(SOURCE = 'OLD' AND ACTION = 'D') -- filter values
|
|
|
Re: Large volume data compare - NEED HELP [message #665704 is a reply to message #665678] |
Fri, 15 September 2017 15:41 |
|
Member2014
Messages: 33 Registered: July 2014 Location: USA
|
Member |
|
|
quirks - This worked perfect.
Another step I added was to create work tables with just the columns I need to compare. The work table creation took about 3 minutes and data comparison for the full volume about 20 mins. Thanks so much for the help. Really appreciate it.
|
|
|
Goto Forum:
Current Time: Sat Sep 28 14:10:35 CDT 2024
|