I need to select out the records that are "duplicated" in a table. Is
it possible to do the reverse of a Select Distinct. I just cannot
seem to get the SQL right?

Re: Selecting Duplicate records by BlackSabbath

BlackSabbath
Thu Sep 18 06:18:44 CDT 2003

select * from MyTable group by MyIdField having count(*) > 1)


--
Olivier (enlever les 3X pour la réponse)
"Alex" <aharrison@naturalproducts.co.uk> a écrit dans le message de news:
b6d5969a.0309180257.34c09098@posting.google.com...
> I need to select out the records that are "duplicated" in a table. Is
> it possible to do the reverse of a Select Distinct. I just cannot
> seem to get the SQL right?



Re: Selecting Duplicate records by gerry

gerry
Thu Sep 18 06:16:27 CDT 2003

something like this should work :

SELECT t1.key , t2.key ;
FROM table as t1 ;
INNER JOIN table as t2 ;
ON t1.somefield == t2.somefield ;
AND t1.key # t2.key

"Alex" <aharrison@naturalproducts.co.uk> wrote in message
news:b6d5969a.0309180257.34c09098@posting.google.com...
> I need to select out the records that are "duplicated" in a table. Is
> it possible to do the reverse of a Select Distinct. I just cannot
> seem to get the SQL right?



RE: Selecting Duplicate records by Leemi

Leemi
Thu Sep 18 08:42:49 CDT 2003

Hi Alex:

Take a look at this article:

118294 How To Create a List of Duplicate Records
http://support.microsoft.com/?id=118294


I hope this helps.

This posting is provided "AS IS" with no warranties, and confers no rights.

Sincerely,
Microsoft FoxPro Technical Support
Lee Mitchell

*-- VFP8 HAS ARRIVED!! --*
Read about all the new features of VFP8 here:
http://www.universalthread.com/VisualFoxPro/News/VFP8Release.asp
Purchase VFP8 here:
http://shop.microsoft.com/Referral/Productinfo.asp?siteID=11518

Keep an eye on the product lifecycle for Visual FoxPro here:
http://support.microsoft.com/default.aspx?id=fh;[ln];lifeprodv
- VFP5 Mainstream Support retires June 30th, 2003
- VFP6 Mainstream Support retires Sept. 30th, 2003

> I need to select out the records that are "duplicated" in a table. Is
> it possible to do the reverse of a Select Distinct. I just cannot
> seem to get the SQL right?


Re: Selecting Duplicate records by gerry

gerry
Thu Sep 18 11:02:58 CDT 2003

some test i just ran on different ways to do this.

2 things surprised me :
1) the fastest process was via SCAN
2 ) LEFT JOIN is much faster than INNER JOIN


************************************************
* locate duplicate items within table
* 540,000 records
* 2 sets of dups exist each with 2 records each
* indexes :
* xfile_no
* tariff
* xfile_no+tariff tag xxx
************************************************
ON ERROR

_screen.Activate()

SET TALK ON noWINDOW

SELECT srcData
SET ORDER TO

SYS(3054,12)

*!* *!* *** >>> 112 seconds
*!* ?"JOIN "+TRANSFORM(SECONDS())
*!* SELECT s1.xfile_id , s1.tariff ;
*!* FROM srcData s1 ;
*!* INNER JOIN srcData s2 ;
*!* ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
*!* AND s1.row_id # s2.row_id ;
*!* INTO CURSOR xxx

*!* *** >>> 40 seconds
*!* ?"JOIN "+TRANSFORM(SECONDS())
*!* SELECT s2.xfile_id , s2.tariff ;
*!* FROM srcData s1 ;
*!* LEFT JOIN srcData s2 ;
*!* ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
*!* AND s1.row_id # s2.row_id ;
*!* INTO CURSOR xxx
*!* SELECT * from xxx WHERE !ISNULL(xfile_id) ;
*!* into cursor xxx

*!* *** >>> 48 seconds
*!* ?"JOIN "+TRANSFORM(SECONDS())
*!* SELECT s2.xfile_id , s2.tariff ;
*!* FROM srcData s1 ;
*!* LEFT JOIN srcData s2 ;
*!* ON s1.xfile_id == s2.xfile_id ;
*!* AND s1.tariff == s2.tariff ;
*!* AND s1.row_id # s2.row_id ;
*!* INTO CURSOR xxx
*!* SELECT * from xxx WHERE !ISNULL(xfile_id) ;
*!* into cursor xxx
*!*


*!* *** >>> 560 seconds
*!* ?"GROUP "+TRANSFORM(SECONDS())
*!* select * from srcData ;
*!* group by xfile_id , tariff having count(*) > 1 ;
*!* INTO CURSOR xxx

*!* *** >>> 36 seconds
*!* ?"GROUP "+TRANSFORM(SECONDS())
*!* select xfile_id , tariff from srcData ;
*!* group by xfile_id , tariff having count(*) > 1 ;
*!* INTO CURSOR xxx


*!* *** >>> 21 seconds
*!* ?"ORDER "+TRANSFORM(SECONDS())
*!* SELECT srcData
*!* SET ORDER to tag xxx

*!* LOCAL tariff
*!* m.tariff="!!!"
*!* ?"SCAN "+TRANSFORM(SECONDS())
*!* SCAN
*!* IF m.tariff==srcData.tariff THEN
*!* ? srcData.xfile_id+" "+srcData.tariff
*!* ENDIF
*!* m.tariff=srcData.tariff
*!* ENDSCAN
*!* *!* USE IN xxx

?"DONE "+TRANSFORM(SECONDS())

WAIT WINDOW "xxx"



Re: Selecting Duplicate records by gerry

gerry
Thu Sep 18 14:34:41 CDT 2003

Hi ,

I ran the sequence of tests multiple times and restarted VFP between each
individual test.
I also ran each test multiple times in a row.
The results were all pretty consistant with the times stated.



"Wolfgang Schmale" <w_schmale@eplus-online.de> wrote in message
news:eYvx9OhfDHA.3200@tk2msftngp13.phx.gbl...
> Hi Gerry!
>
> Did you restart VFP or better the whole machine after every test? If not
the
> caching of the data by VFP or Windows has an influence on your results
>
> --
> _________________
>
> MFG
> Wolfgang Schmale
>
> MS Visual FoxPro MVP
>
> --------------------------------
> "gerry" <germ@hotmail.com> schrieb im Newsbeitrag
> news:uZmMp8ffDHA.132@tk2msftngp13.phx.gbl...
> > some test i just ran on different ways to do this.
> >
> > 2 things surprised me :
> > 1) the fastest process was via SCAN
> > 2 ) LEFT JOIN is much faster than INNER JOIN
> >
> >
> > ************************************************
> > * locate duplicate items within table
> > * 540,000 records
> > * 2 sets of dups exist each with 2 records each
> > * indexes :
> > * xfile_no
> > * tariff
> > * xfile_no+tariff tag xxx
> > ************************************************
> > ON ERROR
> >
> > _screen.Activate()
> >
> > SET TALK ON noWINDOW
> >
> > SELECT srcData
> > SET ORDER TO
> >
> > SYS(3054,12)
> >
> > *!* *!* *** >>> 112 seconds
> > *!* ?"JOIN "+TRANSFORM(SECONDS())
> > *!* SELECT s1.xfile_id , s1.tariff ;
> > *!* FROM srcData s1 ;
> > *!* INNER JOIN srcData s2 ;
> > *!* ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
> > *!* AND s1.row_id # s2.row_id ;
> > *!* INTO CURSOR xxx
> >
> > *!* *** >>> 40 seconds
> > *!* ?"JOIN "+TRANSFORM(SECONDS())
> > *!* SELECT s2.xfile_id , s2.tariff ;
> > *!* FROM srcData s1 ;
> > *!* LEFT JOIN srcData s2 ;
> > *!* ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
> > *!* AND s1.row_id # s2.row_id ;
> > *!* INTO CURSOR xxx
> > *!* SELECT * from xxx WHERE !ISNULL(xfile_id) ;
> > *!* into cursor xxx
> >
> > *!* *** >>> 48 seconds
> > *!* ?"JOIN "+TRANSFORM(SECONDS())
> > *!* SELECT s2.xfile_id , s2.tariff ;
> > *!* FROM srcData s1 ;
> > *!* LEFT JOIN srcData s2 ;
> > *!* ON s1.xfile_id == s2.xfile_id ;
> > *!* AND s1.tariff == s2.tariff ;
> > *!* AND s1.row_id # s2.row_id ;
> > *!* INTO CURSOR xxx
> > *!* SELECT * from xxx WHERE !ISNULL(xfile_id) ;
> > *!* into cursor xxx
> > *!*
> >
> >
> > *!* *** >>> 560 seconds
> > *!* ?"GROUP "+TRANSFORM(SECONDS())
> > *!* select * from srcData ;
> > *!* group by xfile_id , tariff having count(*) > 1 ;
> > *!* INTO CURSOR xxx
> >
> > *!* *** >>> 36 seconds
> > *!* ?"GROUP "+TRANSFORM(SECONDS())
> > *!* select xfile_id , tariff from srcData ;
> > *!* group by xfile_id , tariff having count(*) > 1 ;
> > *!* INTO CURSOR xxx
> >
> >
> > *!* *** >>> 21 seconds
> > *!* ?"ORDER "+TRANSFORM(SECONDS())
> > *!* SELECT srcData
> > *!* SET ORDER to tag xxx
> >
> > *!* LOCAL tariff
> > *!* m.tariff="!!!"
> > *!* ?"SCAN "+TRANSFORM(SECONDS())
> > *!* SCAN
> > *!* IF m.tariff==srcData.tariff THEN
> > *!* ? srcData.xfile_id+" "+srcData.tariff
> > *!* ENDIF
> > *!* m.tariff=srcData.tariff
> > *!* ENDSCAN
> > *!* *!* USE IN xxx
> >
> > ?"DONE "+TRANSFORM(SECONDS())
> >
> > WAIT WINDOW "xxx"
> >
> >
>



Re: Selecting Duplicate records by gerry

gerry
Thu Sep 18 17:23:46 CDT 2003

i don't know if anyone has any interest in this stuff but another approach I
tried was to process the file in pieces :

USE srcData!SrcData IN 0 AGAIN ALIAS ids
SELECT ids
SET order to tag xfiles && Unique index on xfiles_id -> total 20
unique values
SCAN
SELECT s1.xfile_id , s1.tariff ;
FROM srcData s1 ;
INNER JOIN srcData s2 ;
ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
AND s1.row_id # s2.row_id ;
WHERE s1.xfile_id == ids.xfile_id ;
INTO CURSOR xxx
ENDSCAN


This scheme cut the time for the self JOIN from 112 seconds to 31 seconds .
It had no effect on the GROUP
It is not applicable to the SCAN.

So , in the end , the SCAN is still winner , and after some thought this
does make some sense as the SCAN only requires a single read of each record
whereas the JOIN method involves reading each record at least 2 times - the
GROUP by i'm not sure about record accesses but in any case there would
obviously be a fair amount of extra processing involved.



"gerry" <germ@hotmail.com> wrote in message
news:#3Xi8yhfDHA.2352@TK2MSFTNGP12.phx.gbl...
> Hi ,
>
> I ran the sequence of tests multiple times and restarted VFP between each
> individual test.
> I also ran each test multiple times in a row.
> The results were all pretty consistant with the times stated.
>
>
>
> "Wolfgang Schmale" <w_schmale@eplus-online.de> wrote in message
> news:eYvx9OhfDHA.3200@tk2msftngp13.phx.gbl...
> > Hi Gerry!
> >
> > Did you restart VFP or better the whole machine after every test? If not
> the
> > caching of the data by VFP or Windows has an influence on your results
> >
> > --
> > _________________
> >
> > MFG
> > Wolfgang Schmale
> >
> > MS Visual FoxPro MVP
> >
> > --------------------------------
> > "gerry" <germ@hotmail.com> schrieb im Newsbeitrag
> > news:uZmMp8ffDHA.132@tk2msftngp13.phx.gbl...
> > > some test i just ran on different ways to do this.
> > >
> > > 2 things surprised me :
> > > 1) the fastest process was via SCAN
> > > 2 ) LEFT JOIN is much faster than INNER JOIN
> > >
> > >
> > > ************************************************
> > > * locate duplicate items within table
> > > * 540,000 records
> > > * 2 sets of dups exist each with 2 records each
> > > * indexes :
> > > * xfile_no
> > > * tariff
> > > * xfile_no+tariff tag xxx
> > > ************************************************
> > > ON ERROR
> > >
> > > _screen.Activate()
> > >
> > > SET TALK ON noWINDOW
> > >
> > > SELECT srcData
> > > SET ORDER TO
> > >
> > > SYS(3054,12)
> > >
> > > *!* *!* *** >>> 112 seconds
> > > *!* ?"JOIN "+TRANSFORM(SECONDS())
> > > *!* SELECT s1.xfile_id , s1.tariff ;
> > > *!* FROM srcData s1 ;
> > > *!* INNER JOIN srcData s2 ;
> > > *!* ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
> > > *!* AND s1.row_id # s2.row_id ;
> > > *!* INTO CURSOR xxx
> > >
> > > *!* *** >>> 40 seconds
> > > *!* ?"JOIN "+TRANSFORM(SECONDS())
> > > *!* SELECT s2.xfile_id , s2.tariff ;
> > > *!* FROM srcData s1 ;
> > > *!* LEFT JOIN srcData s2 ;
> > > *!* ON s1.xfile_id+s1.tariff == s2.xfile_id+s2.tariff ;
> > > *!* AND s1.row_id # s2.row_id ;
> > > *!* INTO CURSOR xxx
> > > *!* SELECT * from xxx WHERE !ISNULL(xfile_id) ;
> > > *!* into cursor xxx
> > >
> > > *!* *** >>> 48 seconds
> > > *!* ?"JOIN "+TRANSFORM(SECONDS())
> > > *!* SELECT s2.xfile_id , s2.tariff ;
> > > *!* FROM srcData s1 ;
> > > *!* LEFT JOIN srcData s2 ;
> > > *!* ON s1.xfile_id == s2.xfile_id ;
> > > *!* AND s1.tariff == s2.tariff ;
> > > *!* AND s1.row_id # s2.row_id ;
> > > *!* INTO CURSOR xxx
> > > *!* SELECT * from xxx WHERE !ISNULL(xfile_id) ;
> > > *!* into cursor xxx
> > > *!*
> > >
> > >
> > > *!* *** >>> 560 seconds
> > > *!* ?"GROUP "+TRANSFORM(SECONDS())
> > > *!* select * from srcData ;
> > > *!* group by xfile_id , tariff having count(*) > 1 ;
> > > *!* INTO CURSOR xxx
> > >
> > > *!* *** >>> 36 seconds
> > > *!* ?"GROUP "+TRANSFORM(SECONDS())
> > > *!* select xfile_id , tariff from srcData ;
> > > *!* group by xfile_id , tariff having count(*) > 1 ;
> > > *!* INTO CURSOR xxx
> > >
> > >
> > > *!* *** >>> 21 seconds
> > > *!* ?"ORDER "+TRANSFORM(SECONDS())
> > > *!* SELECT srcData
> > > *!* SET ORDER to tag xxx
> > >
> > > *!* LOCAL tariff
> > > *!* m.tariff="!!!"
> > > *!* ?"SCAN "+TRANSFORM(SECONDS())
> > > *!* SCAN
> > > *!* IF m.tariff==srcData.tariff THEN
> > > *!* ? srcData.xfile_id+" "+srcData.tariff
> > > *!* ENDIF
> > > *!* m.tariff=srcData.tariff
> > > *!* ENDSCAN
> > > *!* *!* USE IN xxx
> > >
> > > ?"DONE "+TRANSFORM(SECONDS())
> > >
> > > WAIT WINDOW "xxx"
> > >
> > >
> >
>
>