Pages

5/11/2011

The cracking of unfamiliar files for data extraction (NetCDF and R)

My boss assigned me some weired data files from a HPLC (High-performance liquid chromatography) machines that we have been working on recently. The built-in software is unable to export data into a common format, say, csv or xls, but it can convert its raw data to some ".cdf" files. We have never heard about such kinds of data before. So I did some research and found some relative information on NASA website. It appeared that the cdf is short for common data format (plain and simple, isn't it?). After trying to install the software provided by NASA on linux, I got an information that the files are actually NetCDF files. It took me no time to locate a powerful package for R for NetCDF I/O. So I post the relevant information here for future reference. The R ncdf package info can be found here.

After running these codes, we can get the structure of the netcdf data file.

library(ncdf)
nc = open.ncdf("Decomposition.013_Ron.013_532011_Admin_013-1.cdf") 
print(nc)

My HPLC data files contain the following information:

[1] "file 48h.001-1.cdf has 11 dimensions:"
[1] "point_number   Size: 4801"
[1] "peak_number   Size: 275"
[1] "_2_byte_string   Size: 2"
[1] "_4_byte_string   Size: 4"
[1] "_8_byte_string   Size: 8"
[1] "_12_byte_string   Size: 12"
[1] "_16_byte_string   Size: 16"
[1] "_32_byte_string   Size: 32"
[1] "_64_byte_string   Size: 64"
[1] "_128_byte_string   Size: 128"
[1] "_255_byte_string   Size: 255"
[1] "------------------------"
[1] "file 48h.001-1.cdf has 22 variables:"
[1] "float ordinate_values[point_number]  Longname:ordinate_values Missval:1e+30"
[1] "float detector_maximum_value[]  Longname:detector_maximum_value Missval:1e+30"
[1] "float detector_minimum_value[]  Longname:detector_minimum_value Missval:1e+30"
[1] "float actual_run_time_length[]  Longname:actual_run_time_length Missval:1e+30"
[1] "float actual_sampling_interval[]  Longname:actual_sampling_interval Missval:1e+30"
[1] "float actual_delay_time[]  Longname:actual_delay_time Missval:1e+30"
[1] "float peak_retention_time[peak_number]  Longname:peak_retention_time Missval:1e+30"
[1] "char peak_name[_32_byte_string,peak_number]  Longname:peak_name Missval:NA"
[1] "float peak_amount[peak_number]  Longname:peak_amount Missval:1e+30"
[1] "float peak_start_time[peak_number]  Longname:peak_start_time Missval:1e+30"
[1] "float peak_end_time[peak_number]  Longname:peak_end_time Missval:1e+30"
[1] "float peak_width[peak_number]  Longname:peak_width Missval:1e+30"
[1] "float peak_area[peak_number]  Longname:peak_area Missval:1e+30"
[1] "float peak_area_percent[peak_number]  Longname:peak_area_percent Missval:1e+30"
[1] "float peak_height[peak_number]  Longname:peak_height Missval:1e+30"
[1] "float peak_height_percent[peak_number]  Longname:peak_height_percent Missval:1e+30"
[1] "float baseline_start_time[peak_number]  Longname:baseline_start_time Missval:1e+30"
[1] "float baseline_start_value[peak_number]  Longname:baseline_start_value Missval:1e+30"
[1] "float baseline_stop_time[peak_number]  Longname:baseline_stop_time Missval:1e+30"
[1] "float baseline_stop_value[peak_number]  Longname:baseline_stop_value Missval:1e+30"
[1] "char peak_start_detection_code[_32_byte_string,peak_number]  Longname:peak_start_detection_code Missval:NA"
[1] "char peak_stop_detection_code[_32_byte_string,peak_number]  Longname:peak_stop_detection_code Missval:NA"


Select the variables ,say, peak_area which is the area between reading and baseline for certain time span. Extract data using the following commend:
peak_area = get.var.ncdf(nc,"peak_area")

The retrieved data are common R vector object which is quite easy to handle.

The source code is attached below:

# TODO: Data extration from NetCDF
# 
# Author: Roger Everett
###############################################################################
library(ncdf)
setwd('Desired Location')
files = dir(pattern = "*.cdf") # I want to convert all the cdf files under my working directory
# create a directory if not exist
if (!file.exists('csv')) dir.create('csv')

for (i in 1:length(files)){
 # open up the ncdf file
 nc = open.ncdf(files[i])#"Decomposition.013_Ron.013_532011_Admin_013-1.cdf"
 
 #point_number = get.var.ncdf(nc,"point_number")
 peak_number = get.var.ncdf(nc,"peak_number")
 #actual_run_time_length = get.var.ncdf(nc,"actual_run_time_length")
 #ordinate_values = get.var.ncdf(nc,"ordinate_values")
 # peak_area correspond to Area
 peak_area = get.var.ncdf(nc,"peak_area")
 peak_height = get.var.ncdf(nc,"peak_height")
 peak_area_percent = get.var.ncdf(nc,"peak_area_percent")
 peak_amount = get.var.ncdf(nc,"peak_amount")
 # peak_retention_time is the R.Time
 peak_retention_time = get.var.ncdf(nc,"peak_retention_time")
 peak_retention_time = peak_retention_time/60 # convert to minute scale
 # area_height_ratio is the A/H in the report
 area_height_ratio = peak_area/peak_height/100
 
 dataset = as.data.frame(cbind(peak_number,peak_area,peak_height,
     peak_area_percent,peak_retention_time,area_height_ratio))
 write.csv(dataset, file = paste('csv/',files[i],'.csv',sep=''), row.names = FALSE,
   fileEncoding = "UTF8")
 
 pdf(paste('csv/',files[i],'.pdf',sep='')) #I want to save a copy of HPLC plot
 print(nc)
 plot(peak_retention_time, peak_height,type='l', xlab='min', main = files[i])
 dev.off()
 close.ncdf(nc)
}

1 comment:

  1. I d like to open cdf file from HPLC-DAD to manipulate with data. Is that possible with R and how, or any other software?

    thanks,
    Perko

    ReplyDelete