Management of filariasis using prediction rules derived from data mining.

The present paper demonstrates the application of CART (classification and regression trees) to control a mosquito vector (Culex quinquefasciatus) for bancroftian filariasis in India. The database on filariasis and a commercially available software CART (Salford systems Inc. USA) were used in this study. Baseline entomological data related to bancroftian filariasis was utilized for deriving prediction rules. The data was categorized into three different aspects, namely (1) mosquito abundance, (2) meteorological and (3) socio-economic details. This data was taken from a database developed for a project entitled "Database management system for the control of bancroftian filariasis" sponsored by Ministry of Communication and Information Technology (MC&IT), Government of India, New Delhi. Predictor variables (maximum temperature, minimum temperature, rain fall, relative humidity, wind speed, house type) were ranked by CART according to their influence on the target variable (month). The approach is useful for forecasting vector (mosquito) densities in forthcoming seasons.


Background:
Public health management requires an understanding of disease transmission, vector control and disease morbidity. Bancroftian filariasis is a mosquito borne disease, infecting nearly 60 million people in South East Asian countries. The annual economic loss due to filariasis in India alone is U.S$1.5 billion. [1-3] The tropical and sub-tropical climate facilitates the proliferation of the mosquito vector (Culex quinquefasciatus) for filariasis. [4,5] The mosquito borne disease is a threat to human population despite the practice of several control strategies. [6] Proper planning and implementation of control measures require adequate exploitation of the available data for disease management. Therefore, it is of interest to develop prediction methods to augment existing mosquito control strategies. Murty et al. used rule-based systems for rapid and accurate identification of malaria causing 54 Indian Anopheline mosquito species. [7] Thus, the use of prediction models in disease management has been realized. [8-10] These tools help epidemiologists to predict the future courses of vector borne diseases. Here, we derive decision rules for vector surveillance using CART (classification and regression tree).

Data formats:
The raw data was stored in EXCEL and the analysis was performed using a commercial software CART (Salford systems Inc. USA). Hence, the raw data was converted to a CART compatible CSV (comma delimited) format.
Data mining tool CART version 5.0 from Salford Systems, California, USA, was used for the current analysis.
[11] CART is a robust and powerful tree based tool for data classification. [12] The tool is suited for the analysis of categorical (classification) and continuous (regression) datasets. The tool uses binary recursive partitioning, in which the parent nodes are exactly split into two child nodes in a recursive manner until the tree is terminated. This depends on the rules used for splitting each node in a tree until the tree is complete. In this process, each terminal node is assigned to a class outcome. CART contains sound statistical tool that enables the development of fast and accurate models. The steps used in the analyses are summarized as follow. The CSV formatted data is loaded to CART using the user interface. The loaded data is used to select and define independent variables (predictor) and predictive (dependent) variables. In this analysis, we defined month as predictive and the other seven variables as predictors. The GINI splitting function is used to maximize the average purity of two child nodes. [12] CART contains two tree types, namely (1) classification and (2) regression. The predictive variable (month) is categorical in this analysis. Hence, we used classification type tree model for this analysis.

Results:
The CART analysis generated a decision tree with 133 terminal nodes based on the selection criteria. Every terminal node represents a decision rule. Out of the 133 terminal nodes, 17 decision rules were in agreement with meteorological and socio-economic parameters. The decision rules (IF -THEN) used in this analysis are given in Table 1. Data in Table 1 shows the distribution of Culex quinquefasciatus density (≤ 2.42 to 84) in PMHD unit over different months of a year. A very low PMHD of ≤ 2.42 is reported for rules #3 and #5 in Table 1. These values correspond to the summer months April and May. This observation corresponds to high maximum temperature (≤ 40.15 C in April and >40.15 C in May) during these months. Thus, high temperature is an influencing parameter for low PMHD in April and May. However, it is also found that the PMHD is >20.75 in April when relative humidity (>54 %) and rainfall is high (≤ 142 mm). Interestingly, PMHD is significantly high during the monsoon and post monsoon months (June, August, September, October, November, December, January and February).
o o

Discussion:
The disease transmission dynamics is modeled using the parameters such as vector (pathogen transmitting agent) surveillance, parasitic load in the human community and sudden environmental changes. [6] We used data mining tools in CART to find relationships between vector data and the predictive variable. These relations are generally hidden in a large dataset. The <IF-THEN> rules in the CART system is used for the prediction of filarial transmission vectors in an effective way.
The PMHD recorded during the summer months for rules #3 and #5 show that there is no risk of filariasis when the role of other influencing parameters is negligible. In Table  1, for rule #4, the PMHD is high due to high relative humidity and total rainfall. This results in an increased risk of disease transmission under these conditions in April.
During the months of October, November and December, a high PMHD (>29.2 to <=84) is recorded for different house types (rules #11, #13, #14, #15 and #16). These rules suggest that the relative humidity is a critical variable on vector density. For rules #1, #2, #7 and #17, the PMHD is elevated due to high total rainfall. Table 1 shows that the four predictors, namely, (1) total rainfall, (2) maximum temperature, (3) minimum temperature, (4) relative humidity and (5) wind speed influenced the target variable in descending order. This is helpful in ranking the predictive variables. Thus, decision trees play an important role in the management of vector borne diseases.

Conclusion:
The principal vector for bancroftian filariasis is the mosquito Culex quinquefasciatus. Surveillance of the filariasis vector is an important issue in disease management. Here, we show that decision rules help to predict and forecast mosquito density during different months of a year in the region. Thus, prediction of vector density is important towards the effective control of vector borne diseases.