Very large data sets

Finally, we draw attention to the problem of visualization of these very large data sets. This is a problem that lies more in the domains of psychology and computer science, but one that could result in immediate benefits to biochemical... [Pg.9]

The Fortran version used in this study was located at the Computer Center at the University of Illinois at Champaign/Urbana. The Fortran version is useful for analysis of very large data sets, i.e. 400 x 70 matrices. The SIMCA-3B version for microcomputer systems is interactive, menu driven, and is applicable to intermediate sized data sets and runs under CPM or MS-DOS. In this study, the SIMCA-3B program—CPLS-2, was used to obtain the results in the PLS examples discussed. [Pg.226]

Visualizing large data sets is a powerful way to explore and analyze data. Dynamic visualization techniques to interactively query, explore, and analyze very large data sets are increasingly used across disciplines (Shneiderman, 2008). One of the most established interactive data visualization... [Pg.253]

The influence of the studies summarized above can be seen in the methods subsequently implemented by many other researchers for their applications (see the section on Chemical Applications). One method that was included in the original assessment studies, but not in the later assessments, is k-means. This method did not perform particularly well on the small data sets of the original studies, and the resultant clusters were found to be very dependent on the choice of initial seeds hence it was not included in the subsequent studies. However, k-means is computationally efficient enough to be of use for very large data sets. Indeed, over the last decade k-means and its variants have been studied extensively and developed for use in other disciplines. Because it is being increasingly used for chemical applications, any future comparisons of clustering methods should include k-means. [Pg.24]

With the very large data sets currently used in one-dimensional experiments (32 or 64 K), digital resolution is seldom a problem. Equation 2-3 also demonstrates that SR and DR are the same. Both are derived from eq. 2-1 in Section 2-4d. DR, however, provides the entry to zero filling and the recovery of lost data points (Section 2-5b). Again, for the previous SR example, with one level of zero filling to 65,536 total points, we have... [Pg.52]

The effects of concentration. When metal adsorption at constant pH is plotted against solution concentration using a double logarithmic scale, linear relationships with slope less than unity are commonly obtained (Fig. 3.). This was first noted by Benjamin and Leckie [25]. In their review of a very large data set, Dzombak and Morel [3] reported that this was a common observation for metal reaction with hydrous metal oxides. This behaviour is consistent with a model in which the reacting sites are not uniform that is, there are a few sites of high affinity, rather more of slightly lower affinity and so on. [Pg.831]

The estimates L1-HB9)1 or L1-KS2.1)1 have standard errors less than 9 percent larger than the least squares estimate if the errors had a Gaussian distribution and have much smaller standard errors than least squares for the more typical longer tailed distributions. For very large data sets where the LI estimate may be prohibitively expensive, the estimates LS-KB9)6 or LS-KS2.1)6 might be useful alternatives. While the properties of these fitting techniques have not yet been explored for situations of any substantial complexity, their performance is, so far, very encouraging. It would thus seem sensible to try them in more complex situations and see how they perform. [Pg.43]

Wellcome Trust Case Control Consortium. Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls. Nature AAl, 661-678 (2007). [A comparison of gen markers and disease risks, based on a very large data set.]... [Pg.744]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...