developing a data warehouse solution for genomics and proteomics (Marshal Peterson-CTO- J craig venter institute)
1) Petabyte
New challenges:
1) computational molecular biology
2) medicine
3) ecosystems
4) scale
5) locality
Requirements:
1) data management: capture, storage, archival, distribution, analysis
2) data movement
3) scheduling computation
- what living organisms are floating in the air
- Microbial abundance
- 4 to 6 x10^30 microbes
- Sargasso sea. Decided to sample the sea water. Picked up because it is very low in nutrients. Didn't expect to see a lot there.
- sargasso sea water
- 100 million letters of genetic code very 24 hours
- 1800 new species
- 1.2 million new genes and 12 complete genomes.
- before that we knew about 120k-180k genes. thus it increased by 1.2 million
- Proteorhodopsin
- take all the genes which might be efficient in producing ethanol....
- world sampling route: gollapogos.
- birds are pink because they eat pink shrim. (remember: you ar ewhat you eat:))
- sargasso sea data: generated 1 billion base pairs of DNA
- 5 million more genes
- 85% of sequence was only seen once.
- arguably:
- 5 million new genes: lots of computational challenges.
- sample at various depths.
- were alleged of Biopiracy
- each region is very very different.
- can we use this data to know how life evolve?
- The much larger problem is making sense of it.
- use these new sequ., genes and gene famlies together with the associated env. data, to better understand the functioning of natureal ecosystems.
Tutorial PART 2:
Bill Blake: Netezza Corporation
- problems:
a) data unreadable b) unstructured c) unintegraded
- approaches data integration:
a) link integration
b) view integration
c) data warehousing
what is a data warehouse:
a) Large collections of table
b) 470 TB data
c) Large multi terabyte databases
d) databases are huge.
e) Databases are persistent.
f) disk storage is < $1 per gig . will be 10c per gig by end of this decade
g) Larger and larger symmetric multiprocessors
h) No. of network data movements problems
i) Netezza doing only for analysis.
j) You do need to look at all of the data. e.g of wireless technologies.
k) Databases grew up on mainframe because of a need of keeping data in proximity to the processing
l) Memory grew larger
m) More power was added by clustering.
n) activedisk technology. decision support algorithms offloaded to active sicks to support key decision support tasks.
o) take storage, move intelligence very close to this storage.
p) FPGA: have it contain disk controller logic. Create a datapath for records coming along each path so that sql operations can be done along each queires. Projecting only certain columns of the table.
assymetric massively parallel processing (tm)
- Cluster and SMP machines multi-user. Entire compute cluster .
- If your goal is to have relational database resource.
- Pure message passing, shared nothing. Oracle is focussed on OLTP, but here its about deep analysis.
- Merged join back or 100s of individual sub query operations. Aggregation is done by communication pattern. Fewer number of intermediate operations,.
- Take an application and letting software define the hardware architecture.
- More intrigued where the game people are going. Keep the power doen and parallelism high
- Cost of 4 or 5 terabyte database.
- FPGA : on each spu, the fpga
- How does queries like top 10 blast hits.
- Added blast capability in sql.
- You gain profit here is by not having to do all the preprocessing
- Not very well integrated.
- Easy data management
- moving processing power as close to the data on disk supports analyzing all the data all the time at tera scale levels
- data types tailored to the nucleotide and protein data supported both storage efficiences and effective analysis
- the largest payoff appears to be the ability to perform complex ad hoc queries involving search and sequence analysis on an integrated single copy of data.
I liked the idea of getting processing power to the data instead of getting data to the processors.
1) Petabyte
New challenges:
1) computational molecular biology
2) medicine
3) ecosystems
4) scale
5) locality
Requirements:
1) data management: capture, storage, archival, distribution, analysis
2) data movement
3) scheduling computation
- what living organisms are floating in the air
- Microbial abundance
- 4 to 6 x10^30 microbes
- Sargasso sea. Decided to sample the sea water. Picked up because it is very low in nutrients. Didn't expect to see a lot there.
- sargasso sea water
- 100 million letters of genetic code very 24 hours
- 1800 new species
- 1.2 million new genes and 12 complete genomes.
- before that we knew about 120k-180k genes. thus it increased by 1.2 million
- Proteorhodopsin
- take all the genes which might be efficient in producing ethanol....
- world sampling route: gollapogos.
- birds are pink because they eat pink shrim. (remember: you ar ewhat you eat:))
- sargasso sea data: generated 1 billion base pairs of DNA
- 5 million more genes
- 85% of sequence was only seen once.
- arguably:
- 5 million new genes: lots of computational challenges.
- sample at various depths.
- were alleged of Biopiracy
- each region is very very different.
- can we use this data to know how life evolve?
- The much larger problem is making sense of it.
- use these new sequ., genes and gene famlies together with the associated env. data, to better understand the functioning of natureal ecosystems.
Tutorial PART 2:
Bill Blake: Netezza Corporation
- problems:
a) data unreadable b) unstructured c) unintegraded
- approaches data integration:
a) link integration
b) view integration
c) data warehousing
what is a data warehouse:
a) Large collections of table
b) 470 TB data
c) Large multi terabyte databases
d) databases are huge.
e) Databases are persistent.
f) disk storage is < $1 per gig . will be 10c per gig by end of this decade
g) Larger and larger symmetric multiprocessors
h) No. of network data movements problems
i) Netezza doing only for analysis.
j) You do need to look at all of the data. e.g of wireless technologies.
k) Databases grew up on mainframe because of a need of keeping data in proximity to the processing
l) Memory grew larger
m) More power was added by clustering.
n) activedisk technology. decision support algorithms offloaded to active sicks to support key decision support tasks.
o) take storage, move intelligence very close to this storage.
p) FPGA: have it contain disk controller logic. Create a datapath for records coming along each path so that sql operations can be done along each queires. Projecting only certain columns of the table.
assymetric massively parallel processing (tm)
- Cluster and SMP machines multi-user. Entire compute cluster .
- If your goal is to have relational database resource.
- Pure message passing, shared nothing. Oracle is focussed on OLTP, but here its about deep analysis.
- Merged join back or 100s of individual sub query operations. Aggregation is done by communication pattern. Fewer number of intermediate operations,.
- Take an application and letting software define the hardware architecture.
- More intrigued where the game people are going. Keep the power doen and parallelism high
- Cost of 4 or 5 terabyte database.
- FPGA : on each spu, the fpga
- How does queries like top 10 blast hits.
- Added blast capability in sql.
- You gain profit here is by not having to do all the preprocessing
- Not very well integrated.
- Easy data management
- moving processing power as close to the data on disk supports analyzing all the data all the time at tera scale levels
- data types tailored to the nucleotide and protein data supported both storage efficiences and effective analysis
- the largest payoff appears to be the ability to perform complex ad hoc queries involving search and sequence analysis on an integrated single copy of data.
I liked the idea of getting processing power to the data instead of getting data to the processors.
Comments