BioConsert | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Presentation Ranking biological data is a crucial need, in particular in the domain of the Life Sciences where thousands of answers can be obtained for a given query. Ranking biological data is also particularly challenging: bio data form an intricate network of links; bio data reflect expertise and may be associated to several levels of confidence... As a result, several ranking methods have been proposed in the last years, but none of them is currently considered as the most promissing. In the BioConsert project (generating Biological Consensus ranking with ties), our aim is to make the most of the data obtained by several ranking methods by generating a consensus ranking able to highlight the common points of a set of rankings while minimizing their disagreements. Our approach is based on the concept of median, originally defined on permutations: Given m permutations and a distance function, the median problem is to find a permutation that is the closest of the m given permutations. We have investigated the NP-hard problem of computing a median of a set of m rankings considering different elements and ties, under a generalized Kendall-tau distance. We present a new heuristic for the problem and we demonstrate the benefit of our approach on real queries using four different ranking methods. We make available all our scripts (implemented in Maple) and data sets below. Publication (submitted) Sarah Cohen-Boulakia, Alain Denise, and Sylvie Hamel Using Medians to generate consensus rankings for biological data Collaborative work: Université Paris-Sud, France & Université de Montréal, Canada Computing the exact median We have implemented the brute force algorithm to compute the exact median. The script is provided here (to be copy pasted in a Maple setting). The BioConsert heuristic The script implementing the BioConsert heuristic is available here (to be copy pasted in a Maple setting). Implementation of approximation approaches We have implemented the Fagin's approach, the script is available here (to be copy pasted in a Maple setting). We have also implemented Ailon's algorithms, the scripts are available here : 2-ailon and 3/2-ailon (to be copy pasted in a Maple setting). Ranking methods considered Four ranking methods have currently been considered PageRank (PR), which has been used for the first time in the context of biological data in the Biozon project, computes for each node a score representing its authority (popularity), based on the probability that a random walk in the graph stops on the target data after an infinite number of steps. InEdge (IE) and PathCount (PC) which base their ordering on the number of incoming edges pointing to each target data object and the number of existing paths to reach each target data object, respectively. Bioggle [N.Laignel Master's thesis], a ranking method introduced in the BioGuide project which takes into account the confidence users may have on the databases providing data objects. Data sets We have tested our approach on seven biological queries of the form What are the genes known to be possibly associated with the disease X where X can take seven alternative values: Breast cancer, Prostate cancer, Bladder cancer, Retinoblastoma, Neuroblastoma, ADHD (Attention Deficit Hyperactivity Disorder), and Long QT syndrome. For each disease, we provide two files: the mapping between the gene names and arbitrary ids we have given, and all the results we have obtained on our experiments. Long QT syndrome The mapping between the numbers appearing in the files and gene names Results of the ranking methods and consensus rankings obtained. ADHD (Attention Deficit Hyperactivity Disorder) The mapping between the numbers appearing in the files and gene names. Results of the ranking methods and consensus rankings obtained. Neuroblastoma The mapping between the numbers appearing in the files and gene names. Results of the ranking methods and consensus rankings obtained. Retinoblastoma The mapping between the numbers appearing in the files and gene names. Results of the ranking methods and consensus rankings obtained. Breast cancer The mapping between the numbers appearing in the files and gene names. Results of the ranking methods and consensus rankings obtained. Prostate cancer The mapping between the numbers appearing in the files and gene names. Results of the ranking methods and consensus rankings obtained. Bladder cancer The mapping between the numbers appearing in the files and gene names. Results of the ranking methods and consensus rankings obtained. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
bioguide-project.net |
Laboratoire de Recherche en Informatique, University of Pennsylvania |