Duplicated Data (pandas)
in Study / Computer science on Pandas
- On data analysis
- 1st : Check whether there are duplicated data
- Standard can be various (ex. Values from all columns are same on two rows, Values from just 2 columns are same on two rows)
- 2nd : remove the duplicated (corresponding to specific standard) row
- 1st : Check whether there are duplicated data
Check Duplicated Data
- df.duplicated()
- Return Series that has boolean value
- If one row (one index of Series) is duplicated from the previous row, the value of Series is True
- Return Series that has boolean value
- df[column’s name].duplicated() or df[[‘c1’, ‘c2’, ….]].duplicated()
- Return Series that has boolean value
- Standard for checking whether it is duplicated or not is limited to some columns
- Return Series that has boolean value
Remove Duplicated Data
- df = df.drop_duplicates() (inplace can be used)
- Dataframe that duplicated data (row) are removed
- df = df.drop_duplicates(subset=[‘c1’, ‘c2’, …])
- Dataframe that duplicated data (row) checked using some columns as standard are removed