Importing data with non-standard blank columns in R

I’m trying to import a data set to work with in R (using the tidyverse). Unfortunately, its a government data set which almost always means some odd standards. In this case, where a certain observation doesn’t have a value for a given variable, ND has been entered as a text string.

I am reluctant to open it up in Excel and manually track down every ND to replace it with a blank cell (even using find and replace) – because it obviously makes my code less replicable. But not doing so means that, when I import the data using read_csv, some of my variable types don’t work properly (e.g. I can’t happily make a column a double). Is there a way, in the data import process, to replace all of those ND entries with "standard" NAs?

I’ve included my code below.

Apologise if this is an obvious question with a simple answer – I’m fairly new to this.

Thanks

 > #Load libraries > library(tidyverse) -- Attaching packages --------------------------------------- tidyverse 1.2.1 -- v ggplot2 3.2.1     v purrr   0.3.3 v tibble  2.1.3     v dplyr   0.8.3 v tidyr   1.0.0     v stringr 1.4.0 v readr   1.3.1     v forcats 0.4.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag()    masks stats::lag()  >  > #Import data >  > #From March 2020, REGION_GEOG and REGION_GEOG_CODE fields removed  > GP.prac.Mar20 <- read_csv( +   "Data/GPWorkforcePracticeLevel/17. General Practice March 2020 Practice level.csv", +   col_types = cols( +     .default = col_double(), +     PRAC_CODE = col_character(), +     PRAC_NAME = col_character(), +     CCG_CODE = col_character(), +     CCG_NAME = col_character(), +     PCN_CODE = col_character(),     +     PCN_NAME = col_character(), +     STP_CODE = col_character(), +     STP_NAME = col_character(), +     REGION_CODE = col_character(), +     REGION_NAME = col_character(), +     HEE_REGION_CODE = col_character(), +     HEE_REGION_NAME = col_character(), +     CONTRACT = col_character(), +     GP_SOURCE = col_character(), +     NURSE_SOURCE = col_character(), +     DPC_SOURCE = col_character(), +     ADMIN_SOURCE = col_character() +   ) + )   |=================================================================| 100%   11 MB Warning: 91380 parsing failures. row                    col expected actual                                                                               file   9 TOTAL_DPC_HC           a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice March 2020 Practice level.csv'   9 TOTAL_DPC_DISPENSER_HC a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice March 2020 Practice level.csv'   9 TOTAL_DPC_HCA_HC       a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice March 2020 Practice level.csv'   9 TOTAL_DPC_PHLEB_HC     a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice March 2020 Practice level.csv'   9 TOTAL_DPC_PHARMA_HC    a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice March 2020 Practice level.csv' ... ...................... ........ ...... .................................................................................. See problems(...) for more details.  >  
Add Comment
1 Answer(s)

try reading the files using data.table's fread() using the proper na.strings = settings

library( data.table )  #without na.strings set data.table::fread(  "col1, col2 ND, test" )  #    col1 col2 # 1:   ND test  #with na.strings set data.table::fread(    "col1, col2   ND, test", na.strings = "ND" )  #    col1 col2 # 1:   NA test 
Answered on August 30, 2020.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.