QC of the Pacrain Daily Rainfall Database  -Nicolas Mirsky
updated 5/15/2000

  In a desire to provide the user community with more information regarding the quality of the Comprehensive pacific rainfall database, we have undertaken a series of tests to examine the data in more detail.  The first test analyzed the percentage of zero/trace rainfall and the average daily rainfall per day of the week (Mon., Tues., etc.).  The purpose of this test was to potentially uncover the human element in the recording of data – for example, perhaps an employee reports erroneous values for the weekends when he/she returns on Monday.  It was discovered in this examination that one of the most dominant error patterns was not indicating rainfall as an accumulated value while incorrectly recording zero values prior to the accumulation.  A clear example of this error pattern is illustrated below :
 
Station  # Sun. w/data  # Mon. w/data  # Tue. w/data # Wed. w/data # Thu. w/data # Fri. w/data # Sat. w/data
14000 759  831 828 826 811 820 754

   
     
Station  % Sun. zeros % Mon. zeros % Tue. zeros % Wed. zeros % Thu. zeros % Fri. zeros % Sat. zeros
14000 0.72 0.26 0.41 0.38 0.45 0.43 0.72

 
Station Sun. mean
rain
Mon. mean
rain
Tue. mean
rain
Wed. mean
rain
Thu. mean
rain
Fri. mean
rain
 Sat. mean
rain
14000 78.4  308.9 170.6 159.6 133.5 138.2 80.2

 
It is possible for the pattern above to occur over a relatively short period of time making it harder to detect.  This was the main motivation behind the next QC test which examined daily rainfall values having a cumulative gamma probability greater than or equal to 0.98 AND having at least 5 preceding zeros/trace/missing (gamma distributions were fitted separately for each station).
Comparing rainfall records among neighboring stations was another QC test that was performed.  The comparisons were done over 5-year periods for those atolls, raised atolls, and low island stations that recorded a sufficient amount of data for the period.  Initial results, while perhaps showing some significant differences in daily rainfall statistics among some of the neighboring stations, were inconclusive.
 The last QC test used a tropical storm database in conjunction with a tropical storm model (see descriptions below) to get estimates of how much rain one might reasonably expect on tropical storm days.  Since the exact accuracy of the tropical storm model is unknown, one of the most useful results of this analysis was to list instances when a station recorded zero rainfall on a significant tropical storm day.
 As a summary of how well each station performed on the QC tests as a whole, a final score was calculated for each station.  Only stations which had a sufficient amount of data were scored.  Since the scores for this test are somewhat subjective, one should view them with caution and not assume a station’s degree of reliability based on it’s score.  Also, stations with high quality data for a fraction of their of the overall record would receive a low to medium score.  In summary, these techniques have been applied to examine the accuracy of the data, and to determine potential problems.  The use of the data is, of course, ultimately left up to the individual user.  We look forward to hearing comments from the user community about the efficacy of these results.  The files produced by the different tests, and a description of the methods and results, are provided below.

Description of QC files and methodologies

QC files (in Microsoft Excel, total space 4.76Mb):

Download ALL files in one master zip file.

1)  histog.xls
2)  station_stats.xls
3)  large_amts98.xls
4)  large_amts98_5.xls
5)  large_amts10.xls
6)  large_amts_accum.xls
7)  stats71to75.xls
8)  stats76to80.xls
9)  stats81to85.xls
10)  stats86to90.xls
11)  stats91to97.xls
12)  dow_table.xls
13)  hurr_NH_all.xls
14)  hurr_SH_all.xls
15)  hurr_susp_zeros.xls
16)  final_scores.xls

1) histog.xls   download

Histogram of daily rainfall from ALL the stations in the Pacrain database from the period 1971 to 1997.  Amounts are in 10ths of mm.

2)  station_stats.xls    download

(Note: missing values in station_stats.xls are given as –99999)
Lat and Lon :  latitudes and longitudes given in decimal format with S and W as negative.
Elev : elevation given in meters above sea level
Start YrMo, End YrMo : starting and ending years/months (YYYYMM) of station record.
Tot. Poss. Meas. Days :  indicates the total number of days between the start and end dates of the station record.
NUM Missing :  number of days when a ‘missing’ was recorded – NOT the same as the number of days in data gaps.
Most Consec. Years No Meas. :  indicates the largest data gap in years
AVG, MED, MAX, STD of Daily Rain :  these statistics given in 10ths of mm (MED=median, STD=standard deviation).
% Zero and Trace :  percentage of zeros and trace in the set of daily rainfall records (does not include the number of days from data gaps).
Gamma PDF – Alpha : gamma distribution shape parameter (in 10ths of mm).
Gamma PDF – Beta : gamma distribution size parameter (in 10ths of mm).

3)  large_amts98.xls   download

For each station, daily rainfall amounts with a cumulative gamma probability greater than or equal to 0.98 were selected (gamma distributions were figured separately for each station).  Each rainfall is given in 10ths of mm, along with the year, month, and day.  In addition, the following fields are given :

Num. Days Accum. :  some stations report rainfall as accumulated and this figure gives the number of prior days which make up the accumulated amount (0 indicates that the amount is not an accumulated value).
Num. of Prec. Zeros/Trace/Missing :  the number of zeros/trace/missing which precede the large rainfall.  This statistic might be useful since it was found that some stations falsely don’t report values as accumulated and fill in zero rainfall amounts in between the accumulated values.

4)  large_amts98_5.xls    download

Same as #3 above except that values are selected from #3 which have at least 5 preceding zeros/trace/missing (therefore are not accumulated values).  The amount of at least 5 preceding zeros/trace/missing was chosen in order to increase the potential of the rainfall being an accumulated value.  These values are candidates for the error pattern described at the bottom of #3.  NOTE : the zero values which precede these large amounts are potentially suspicious as well.  Other fields that have been added are :

Trop. Storm Day? (1=y, 0=n) :  indicates whether there was a tropical storm in the vicinity (<480km as defined by the tropical storm model.  Note that the model rainfall rates at the distance of 400km to 480km are on the order of 0.1mm/hr for depressions and  0.8mm/hr for typhoons) on that day.
If Trop. Storm Day, amt. rain exp. (mm) :  a hurricane model (described in hurr_NH_all.xls below) along with a database of pacific tropical storms was used to determine estimates of expected rainfall during tropical storm events given the storm’s intensity and proximity.

5)  large_amts10.xls    download

Daily rainfalls greater than 10 inches (for description of the tropical storm fields, see #4 above).

6)  large_amts_accum.xls        download

A small table showing accumulated rainfalls greater than or equal to 10 inches with a maximum of 3 days prior making up the accumulation.

7-11)  statsYYtoYY.xls       download

The purpose of creating these files was to map and compare daily rainfall statistics over  5-year periods.  Only atolls, raised atolls, and low islands were considered.  Since many stations do not have continuous data, only stations with a minimum of 70% data during the period in question were used (to insure a good degree of time overlap).  The mapping analysis did show some (perhaps) significant differences, however, more work would need to be done for more conclusive results.
 The rainfall amounts were given in 10ths of mm.  The mean and standard deviation without zero/trace amounts as well as the standard deviation of all data were computed for each station.  Also included were the lag-1, 2, and 3 autocorrelations.
 Maps of these statistics are available upon request.

12)  dow_table.xls       download

A useful table which takes into account the human element of recording data on a daily basis.  For each station and for each year, average rainfall and percentage of zero/trace were computed for each day of the week (Mon, Tues, etc. – hence the name “dow” which stands for “day of the week”).  For each station and year, the table shows the sample size for each day of the week.  Error bars for each average and zero percentage were computed as follows :

for the average :  +/- 1.5?/sqrt(N)

where ? is the sample standard deviation, and N is the sample length (where the sample is the set of values for a particular day of the week).

for zero percentage :  +/- 1.5sqrt[ p(1- p)/N ]

where p is the zero percentage, and N is the sample length.

For both the average and zero percentage, values were compared (among the days of the week for each year) and whenever the difference was large enough such that there was no overlap of error bars, a significant difference was defined.  In the table, a “1” indicates there is a significant difference (“0” indicates no significant difference).
 

 13 –14) hurr_(“NH” or “SH”)_all.xls       download

A database of all pacific tropical storms (ranging in intensity from depression to typhoon) for both hemispheres for the years 1971-1997 was used (the tropical storm database was obtained from data available to the public at:  1) Unisys weather page, http://weather.unisys.com, and 2) JTWC’s Tropical Cyclone Best Track Data Site, http://www.npmoc.navy.mil/products/jtwc/best_tracks/index.html).  The tropical storm locations and intensity (highest sustained winds) in the database are given every six hours (“best track”).
The goal of using this tropical storm data was to get an approximation of how much rain to expect at a station given a storm’s intensity and proximity.  A rainfall rate model for northern hemisphere western pacific tropical storms (ranging from depression to typhoon) was used from Adler et al., Monthly Weather Review, 1981, #109, p. 506-521.  This model used satellite estimates of rainfall.
The intensity of many of the SH tropical storms from the 1970s were not known and in order to compute an expected rainfall, a value was needed which was arbitrarily chosen as 35kts.  In addition, some station measuring times are not known, in which case, a measuring time of noon local time was arbitrarily chosen.
A Normal distribution was fitted to the set of expected minus reported rains and cumulative Normal probabilities were given for each such difference (last field in the table).

15)  hurr_susp_zeros.xls      download

From the tropical storm files described above (#14-15), all zero rainfalls with cumulative Normal probabilities of expected minus reported rainfall greater than or equal to 0.75 were selected.  These are zero rainfalls on significant tropical storm days.  If the severity of the storm prevented rainfall measurement, then a “missing” instead of “zero” should have been reported.  Zero rainfalls were specifically selected since this fits the error pattern described above in #3. Therefore, any following zeros as well as the first non-zero value after the zero value from a tropical storm day may be viewed with suspicion.   Also, zero rainfalls were selected since we do not know the accuracy of the tropical storm model.  With the error pattern described in #3, it is also possible to get large negative differences in expected minus reported rainfall – this is taken into account in the final_scores.xls file.

16)  final_scores.xls        download

Using a few quality control tests, “reliability” scores for stations with enough data can be computed.  This is a subjective process which highlights the potential reliability of a station’s rainfall record.  The three QCs used were: QC1-rainfall per day of the week statistics (see dow_table.xls above), QC2-gamma distribution statistics (see large_amts98_5.xls above), and QC3-rainfall statistics during tropical storm events (see #s 14-16 above).  For QC1, only stations having at least 5 years with each year having of a minimum of 140 records were scored.  For QC2, only stations with at least 30 rainfalls (over the whole station record) having cumulative gamma probability greater than or equal to 0.98 were scored.  For QC3, only stations with at least 30 total tropical storm days (over the whole station record) were scored.
For each test, certain ratios were calculated (“raw” score) for each station and a standardized anomaly was calculated for the ratio (the z-score standardization method was used).  For example, one of the QC1 tests is the ratio of the number of years with no significant difference in zero percentage to the total number of years (only years with a minimum of 140 records were considered).  This ratio was found for each station and then standardized (“S.A.” columns in the final_scores.xls table stand for “standardized anomaly”).  (NOTE: the QC3 score is the result of two tests with the first QC3 test’s S.A. receiving double weight since the accuracy of the tropical storm model is not known.  The second QC3 test defines significant differences between expected and reported rains as those which have cumulative Normal probabilities in the 10% of each tail of the Normal distribution.).
The final score for each station is the average of each QC with the following weights: QC1=4, QC2=1, QC3=1.  The weight for QC1 was arbitrarily chosen to be 4 since it is the most conclusive test.  When less than 3 of the QC scores were known, either the average (keeping the same weights above) or just the single QC score became the final score.  For a station to receive a final score, it had to have at least 5 years with a minimum of 140 records each.
For stations which did receive a final score due to there being too little data, a “NaN” (“not a number” – kept this way for convenient reading into a numerical matrix) was indicated.  NOTE:  This is a subjective scoring system and it must be emphasized that stations with a relatively low score are not necessarily unreliable.  Also, If you are planning to use data from a particular station which has a relatively high score, it is still a good idea to examine the QC files listed above.  For example, while a station may have a decent score, it may be found in the file dow_table.xls that there is clearly a suspicious rainfall pattern over a 3-year period out of it’s 20-year record.