Screen Shot 2015-02-08 at 12.44.31 PM

Disease detection of both Emerging and Re-Emerging Infectious Disease (EID) addresses incidence of disease in humans that has continued to increase in the past two decades and threatens to continue to increase in the near future. EID also includes new or unrecognized diseases, those that are spreading to new geographic areas and hosts, as well as those that are re-emerging such as Drug Resistant Tuberculosis (TB).

As public health organizations begin to address the demands of screening, diagnosing and treating these infectious diseases we will see a rapid growth of demand side “need” for more flexible, time-efficient, and cost effective diagnostics for all human conditions.

From this perspective, POC diagnostics can have the greatest impact in the field and in community-based clinics where the vast majority of the populations in need can be found. It is also necessary both for prompt diagnosis and for providing health services evenly throughout the population, including the rural districts. The requirements can only be fulfilled by technologies like real-time POC sensor technologies offering the ability to screen for or diagnose diseases in a fast, simple, safe, and reliable manner, and with improved sensitivities. Portability and rugged construction of the sensors, along with diminishing sample size requirements, and use of non-invasive samples (like breath or saliva) and reduced power consumption make this new generation of POC sensors attractive for use in settings not typically hospitable for medical technologies. Further, many of these new technologies are based on inorganic materials, obtained non-invasively, lowering the overall cost of testing through decreased production costs, lower rates of ‘spoilage’, and longer shelf lives. Integrating these POC sensors with wireless or cellular networks allows an immediacy of data access never before available. The effectiveness of these POC sensors will be realized only when we learn to use the data they gather in the field to understand all of the geographic, social, behavioral and environmental factors, and the host of other human activities, that often accelerate and amplify these natural phenomena.

A review of all infectious diseases of clinical significance revealed a significant potential public health benefit in mapping the real time distribution, condition and treatment efficacy of infectious diseases.

The lack of timely and relevant geographical information often frustrates a variety of clinical, epidemiological, and public health aspirations, particularly in areas affected with epidemic prone diseases.

In India, in particular, this information gulf between retrospective data, which is often months or even years old, and real time data detailing the spatial distribution of a disease, has serious implications for public health surveillance. Historically little attention has been given to spatial epidemiology in regional and national preparedness planning, disease treatments and successes or failures in management practices and it is almost impossible to effectively correlate non health related information to diagnostics data. This makes it difficult or even impossible to effectively gauge the risk posed by new infectious disease outbreaks because we have only the crudest understanding of their natural geographical range or temporal development and the factors that might impact treatment logistics, decision-making, or disease progression.

A significant barrier exists however with these tools for POC sensor data, and much of the potentially related data sources because they are generally unstructured and constantly changing. Almost all of these analytics tools require a clean, multidimensional data warehouse from which to operate the regular servicing and structure required to prevent decay of the data model. Cleansing and dimensionally modeling data currently requires tremendous cost, skill and time. For example, an average data warehouse requires eight or more months to construct, involving teams of source-data experts who are subject matter experts with a clear understanding of the expected analytic outcomes. Further, once a data warehouse is constructed, regular servicing of the structure and processing is necessary to prevent decay of the data model.

We use a Virtual Data Warehouse (VDW) to address the problem of cleansing and dimensionalizing heterogeneous source data through a fundamentally different approach specifically adapted to POC sensor data. The common method and tooling used in data warehouse construction today is known as ETL (Extract Transform Load). For over ten years ETL has successfully provided businesses and researchers the ability to copy source data from various repositories and origins, while cleansing, transforming and modeling data in multidimensional structures such that meaningful analysis could be performed.

The ETL methods however account for over 75% of the total cost of building and maintaining data warehouses.

EFT Data Warehouse

Our approach dramatically reduces the amount of time, expense and subject matter expertise required in the construction and maintenance of data warehouses through an evolutionary paradigm involving the following four concepts:

  • Index: All original heterogeneous data sources are indexed (the current ETL method requires all modeled data to be copied, inherently establishing an N+1 data availability limitation).
  • Discoverable Dimensions: Dimensions are automatically discovered through machine learning techniques and rendered visually enabling a business user with the ability to visually refine and further define dimensions required for analysis (versus with the current ETL method, all dimensions have to be specifically determined and declared by the ETL team).
  • Intelligent Aggregations: Aggregations are identified and specified visually by a user through dimensional modeling and exploration, then processed and stored with the index (ensuring rapid and efficient data warehousing operations such as slicing, dicing, drilling, etc).
  • Emulation: A standard interface is provided to the index, which emulates a traditional data warehouse, allowing any analytic tool the ability to function fully as if the “virtual data warehouse” were physically stored and managed using current techniques.


Data Mural integrates secondary use of local and regional passive search query and micro-blogging data as well as actively collected crowd-sourced data for disease surveillance into our mapping and data mining user interface. The success of these methods is well documented and validated for major public health events, including influenza and dengue epidemics, but has never been applied to TB as far as we know.

The sheer volume of health outcome related searches and personal accounts creates incredible new opportunities in monitoring population health and correlated epidemiologically relevant information in real time.

We use this information to build definitive extents and databases on the occurrence of many diseases beginning with TB with plans to extend the system based on our experience in India. The volume, velocity, and variety of occurrence information from these sources will increase rapidly and transform our ability to create geographical baselines for a range of diseases including TB. Leveraging these large, real time, crowd-sourced data will allow Data Mural to more fully consider the interconnections between socio-cultural factors, demographics, weather, and other factors and health to make it easier to achieve best practice and improved patient outcomes.

While integrating these disparate sources of information can be incredibility powerful in improving our understanding of disease, the data collected from these sources can be subject to confounders resulting from factors such as human behavior. For example, a recent assessment of Google’s Flu Trends model found that even with regular modifications to address flaws identified in prior years, the conclusions of these models can be affected by internet behavior that varied between age groups, seasons, geographies, and other factors not directly related to the disease of interest. Real-time disease data could be used to complement this crowd-sourced data and reduce the bias introduced by these sources.

Data Mural is partnering with POC Sensor developers like Nanosynth Materials and Sensors to create a POC screening platform that will automatically collect and aggregate screening results with other relevant epidemiological information, providing the real-time disease data that is needed to overcome the biases that can occur through the use of crowd-sourced data alone.

By integrating individual screening diagnostic results from the POC Sensors in real time and correlating it with relevant, social, demographic, environmental and related data we expect enable decision makers, policy makers and healthcare providers with the ability to better identify and develop effective and timely actions that save lives and keep their populations safer by reducing the spread and incidence of infectious diseases.