List of publications

22 results found

Search by title or abstract

Search by author

Select year

Filter by type

 
2023 Articolo in rivista restricted access

Network homophily via tail inequalities

Homophily is the principle whereby "similarity breeds connections."We give a quantitative formulation of this principle within networks. Given a network and a labeled partition of its vertices, the vector indexed by each class of the partition, whose entries are the number of edges of the subgraphs induced by the corresponding classes, is viewed as the observed outcome of the random vector described by picking labeled partitions at random among labeled partitions whose classes have the same cardinalities as the given one. This is the recently introduced random coloring model for network homophily. In this perspective, the value of any homophily score ?, namely, a nondecreasing real-valued function in the sizes of subgraphs induced by the classes of the partition, evaluated at the observed outcome, can be thought of as the observed value of a random variable. Consequently, according to the score ?, the input network is homophillic at the significance level ? whenever the one-sided tail probability of observing a value of ? at least as extreme as the observed one is smaller than ?. Since, as we show, even approximating ? is an NP-hard problem, we resort to classical tails inequality to bound ? from above. These upper bounds, obtained by specializing ?, yield a class of quantifiers of network homophily. Computing the upper bounds requires the knowledge of the covariance matrix of the random vector, which was not previously known within the random coloring model. In this paper we close this gap. Interestingly, the matrix depends on the input partition only through the cardinalities of its classes and depends on the network only through its degrees. Furthermore all the covariances have the same sign, and this sign is a graph invariant. Plugging this structure into the bounds yields a meaningful, easy to compute class of indices for measuring network homophily. As demonstrated in real-world network applications, these indices are effective and reliable, and may lead to discoveries that cannot be captured by the current state of the art.

network homophily Mahalanobis norm tail inequalities graph partitioning graph invariant over- dispersed degree distributions.
2022 Articolo in rivista open access

Evaluating homophily in networks via HONTO (HOmophily Network TOol): a case study of chromosomal interactions in human PPI networks

Nicola Apollonio ; Daniel Blankenberg ; Fabio Cumbo ; Paolo Giulio Franciosa ; Daniele Santoni

It has been observed in different kinds of networks, such as social or biological ones, a typical behavior inspired by the general principle 'similarity breeds connections'. These networks are defined as homophilic as nodes belonging to the same class preferentially interact with each other. In this work, we present HONTO (HOmophily Network TOol), a user-friendly open-source Python3 package designed to evaluate and analyze homophily in complex networks. The tool takes in input from the network along with a partition of its nodes into classes and yields a matrix whose entries are the homophily/heterophily z-score values. To complement the analysis, the tool also provides z-score values of nodes that do not interact with any other node of the same class. Homophily/heterophily z-scores values are presented as a heatmap allowing a visual at-a-glance interpretation of results.

Homophily Networks
2022 Articolo in rivista open access

A novel method for assessing and measuring homophily in networks through second-order statistics

Apollonio N ; Franciosa PG ; Santoni D

We present a new method for assessing and measuring homophily in networks whose nodes have categorical attributes, namely when the nodes of networks come partitioned into classes (colors). We probe this method in two different classes of networks: (i) protein-protein interaction (PPI) networks, where nodes correspond to proteins, partitioned according to their functional role, and edges represent functional interactions between proteins (ii) Pokec on-line social network, where nodes correspond to users, partitioned according to their age, and edges respresent friendship between users.Similarly to other classical and well consolidated approaches, our method compares the relative edge density of the subgraphs induced by each class with the corresponding expected relative edge density under a null model. The novelty of our approach consists in prescribing an endogenous null model, namely, the sample space of the null model is built on the input network itself. This allows us to give exact explicit expression for the z-score of the relative edge density of each class as well as other related statistics. The z-scores directly quantify the statistical significance of the observed homophily via ?eby?ëv inequality. The expression of each z-score is entered by the network structure through basic combinatorial invariant such as the number of subgraphs with two spanning edges. Each z-score is computed in O(n+ m) time for a network with n nodes and m edges. This leads to an overall efficient computational method for assesing homophily. We complement the analysis of homophily/heterophily by considering z-scores of the number of isolated nodes in the subgraphs induced by each class, that are computed in O(nm) time. Theoretical results are then exploited to show that, as expected, both the analyzed network classes are significantly homophilic with respect to the considered node properties.

computational models statistical methods protein function predictions
2022 Articolo in rivista open access

Evaluation of HIV-1 integrase variability by combining computational and probabilistic approaches

Davide Vergni ; Daniele Santoni ; Yagai Bouba ; Saverio Lemme ; Lavinia Fabeni ; Luca Carioti ; Ada Bertoli ; William Gennari ; Federica Forbici ; Carlo Federico Perno ; Roberta Gagliardin ; Francesca CeccheriniSilberstein ; Maria Mercedes Santoro ; on behalf of the HIV drugresistance group

This study aimed at updating previous data on HIV-1 integrase variability, by using effective bioinformatics methods combining different statistical instruments from simple entropy and mutation rate to more specific approaches such as Hellinger distance. A total of 2133 HIV-1 integrase sequences were analyzed in: i) 1460 samples from drug-naïve [DN] individuals; ii) 386 samples from drug-experienced but INI-naïve [IN] individuals; iii) 287 samples from INI-experienced [IE] individuals. Within the three groups, 76 amino acid positions were highly conserved (<=0.2% variation, Hellinger distance: <0.25%), with 35 fully invariant positions; while, 80 positions were conserved (>0.2% to <1% variation, Hellinger distance: <1%). The H12-H16-C40-C43 and D64-D116-E152 motifs were all well conserved. Some residues were affected by dramatic changes in their mutation distributions, especially between DN and IE samples (Hellinger distance >=1%). In particular, 15 positions (D6, S24, V31, S39, L74, A91, S119, T122, T124, T125, V126, K160, N222, S230, C280) showed a significant decrease of mutation rate in IN and/or IE samples compared to DN samples. Conversely, 8 positions showed significantly higher mutation rate in samples from treated individuals (IN and/or IE) compared to DN. Some of these positions, such as E92, T97, G140, Y143, Q148 and N155, were already known to be associated with resistance to integrase inhibitors; other positions including S24, M154, V165 and D270 are not yet documented to be associated with resistance. Our study confirms the high conservation of HIV-1 integrase and identified highly invariant positions using robust and innovative methods. The role of novel mutations located in the critical region of HIV-1 integrase deserves further investigation.

2021 Articolo in rivista open access

A genome-wide study on differential methylation in different cancers using TCGA database

Santoni D ; Pignotti D ; Vergni D

Background: DNA methylation is the main epigenetic mechanism driving changes in phenotype without altering genotype. Since the end of the seventies the role of methylation in cancer has become increasingly clear. Objective: The aim of this work is to shed light on the impact of methylation events on cancer cells, providing evidence that differential methylation in small regions, mostly characterized by hypermethylation, affects gene regulation while differential methylation in large genomic regions, mostly characterized by hypomethylation, affects chromosomal organization. Methods: By exploiting a solid computational and statistical analysis, methylation maps of cancer and normal samples in six different cancer types were studied, looking for those genomic regions showing differentially methylated patterns between the two conditions. Results: Through a chromosome sliding windows approach, a set of differentially methylated genomic micro regions of size 2 K bp and macro regions of size 1 M bp, were identified. Micro regions are mostly linked to functional elements while macro regions are mostly linked to nuclear chromosome organization. Results discussed in previous works were confirmed, providing clear evidence that hypermethylation mainly occurs in significant micro regions while hypomethylation mainly occurs in significant macro regions. Interestingly the presence of differentially methylated regions common for six different cancers were identified and some unexpected and previously unexplored peculiar methylation patterns were also found. Conclusions: The effective and robust computational and statistical methodology presented in this work can be used to shed light on the role that DNA methylation plays in cancer and in other non malignant diseases and can be customized to study differentially methylated patterns in specific areas of interest of the genome both at a small scale and at a large scale.

Cancer Methylation maps The cancer genome Atlas Gene regulation Chromosomal structure Lamina associated domains
2021 Rapporto tecnico metadata only access

On function homophily of microbial Protein-Protein Interaction Networks.

Nicola Apollonio ; Paolo Giulio Franciosa ; Daniele Santoni

We present a new method for assessing homophily in networks whose vertices have categorical attributes, namely when the vertices of networks come partitioned into classes. We apply this method to Protein- Protein Interaction networks, where vertices correspond to proteins, partitioned according to they func- tional role, and edges represent potential interactions between proteins. Similarly to other classical and well consolidated approaches, our method compares the relative edge density of the subgraphs induced by each class with the corresponding expected relative edge density under a null model. The novelty of our approach consists in prescribing an endogenous null model, namely, the sample space of the null model is built on the input network itself. This allows us to give exact explicit expression for the z-score of the relative edge density of each class as well as other related statistics. The z-scores directly quantify the statistical significance of the observed homophily via ?Ceby?s ?ev inequality. The expression of each z-score is entered by the network structure through basic combinatorial invariant such as the number of subgraphs with two spanning edges. Each z-score is computed in O(n3) worst-case time for a network with n vertices. This leads to an overall effective computational method for assesing homophily. Theoretical results are then exploited to prove that Protein-Protein Interaction networks networks are significantly homophillous.

Protein-Protein Interaction Networks Protein function Homophily
2020 Articolo in rivista restricted access

In the search of potential epitopes for Wuhan seafood market pneumonia virus using high order nullomers

Alarms periodically emerge for viral pneumonia infections due to coronavirus. In all cases, these are zoonoses passing the barrier between species and infect humans. The legitimate concern of the international community is due to the fact that the new identified coronavirus, named SARS-CoV-2 (previously called 2019-nCoV), has a quite high mortality rate, around 2%, and a strong ability to spread, with an estimated reproduction number higher than 2. Even though all countries are doing their utmost to stop the pandemic, the only reliable solution to tackle the infection is the rapid development of a vaccine. For this purpose, the means of bioinformatics, applied in the context of reverse-vaccinology paradigm, can be of fundamental help to select the most promising peptides able to trigger an effective immune response. In this short report, using the concept of nullomer and introducing a distance from human self, we provide a list of peptides that could deserve experimental investigation in the view of a potential vaccine for SARS-CoV-2.

Nullomers, Peptide-HLA, Immunoinformatics, Viral genomes, SARS-CoV-2, Self/Non-Self
2020 Articolo in rivista open access

The farther the better: investigating how distance from human self affects the propensity of a peptide to be presented on cell surface by MHC class I molecules, the case of Trypanosoma cruzi.

More than twenty years ago the reverse vaccinology paradigm came to light trying todesign new vaccines based on the analysis of genomic information in order to selectthose pathogen peptides able to trigger an immune response. In this context, focusingon the proteome of Trypanosoma cruzi, we investigated the link between theprobabilities for pathogen peptides to be presented on a cell surface and their distancefrom human self. We found a reasonable but, as far as we know, undiscoveredproperty: the farther the distance between a peptide and the human-self the higherthe probability for that peptide to be presented on a cell surface. We also found thatthe most distant peptides from human self bind, on average, a broader collection ofHLAs than expected, implying a potential immunological role in a large portion ofindividuals. Finally, introducing a novel quantitative indicator for a peptide tomeasure its potential immunological role, we proposed a pool of peptides that could bepotential epitopes and that can be suitable for experimental testing. The software tocompute peptide classes according to the distance from human self is free available athttp://www.iasi.cnr.it/~dsantoni/nullomers.

Process-Antigen Presentation/Processing; Molecules-MHC; Self/Non-Self; Epitopes; Nullomers; Reverse vaccinology.
2019 Altro metadata only access

Vincitori StartCup Lazio

L'idea imprenditoriale da cui prende origine la start-up ProNeuro, nasce come conseguenza del lavoro di ricerca svolto dai soci fondatori presso il Consiglio Nazionale delle Ricerche (CNR). Questo lavoro ha portato negli ultimi 3 anni al deposito di due domande di brevetto italiano, di cui una già estesa in PCT, che proteggono l'utilizzo della molecola ProNGF-A per scopi terapeutici mirati alla cura di patologie neurologiche e infiammatorie (domanda di brevetto Nr. 102018000003279 del 05/03/2018 e PCT/IB2019/051753 del 05/03/2019) e la produzione di una forma mutata di ProNGF-A e il suo utilizzo per terapia neurologica e di patologie cutanee (domanda di brevetto numero 102019000014646 del 12/08/2019). Tali brevetti sono di proprietà del CNR, mentre ProNeuro ha messo a punto un sistema di offerta finalizzato alla loro valorizzazione. Attraverso attività di Ricerca e Sviluppo, ProNeuro individua principi attivi farmacologici con attività protettiva e riparativa per il sistema nervoso, ne modifica la struttura per renderli maggiormente efficaci, sicuri e biocompatibili, mette a punto i metodi produttivi ed esegue le prime fasi di caratterizzazione dei loro effetti, prima di proporli ad aziende farmaceutiche per un successivo sviluppo come farmaci destinati al mercato. ProNeuro commercializza, quindi, i diritti di utilizzo della proprietà intellettuale e una serie di prodotti collegati alle attività di discovery, produzione (trasferimento tecnologico) e prima validazione sia predittiva che biologica di nuovi neurofarmaci. ProNeuro avrà la forma giuridica di Società a responsabilità limitata e si configura come spin-off CNR. Come tale, il rapporto tra la società ProNeuro e il CNR è regolato dal "Regolamento per la costituzione e la partecipazione del CNR alle Imprese Spin off, Del,18/2019". I brevetti sopracitati, attualmente di proprietà del CNR, verranno concessi in licenza a ProNeuro, con possibilità di sub-licenziare a terzi, sulla base del suddetto Regolamento. Questo prevede, infatti, la cessione a condizioni agevolate delle licenze sui brevetti di proprietà CNR, la messa a disposizione di risorse logistiche e strumentali in fase di start-up e l'autorizzazione al proprio personale a svolgere attività a favore delle spin-off, con copertura dei costi salariali per un terzo del tempo lavorativo per tre anni. La sede dell'impresa è stata individuata presso l'Istituto di Farmacologia Traslazionale del CNR, via del Fosso del Cavaliere 100, 00133 Roma

ProNeuro NGF proNGF
2019 Poster in Atti di convegno metadata only access

A Machine Learning Approach for Disease Genes Signatures

Annalisa Longo ; Venkata Pochiraju ; Daniele Santoni ; Davide Vergni ; Paolo Tieri

In the context of network medicine, disease genes, i.e. genes that have been experimentally associated to the onset or progression of a pathology, show a complex set of features that are not easily reduced to, and grasped by a simple network approach (e.g., studying centrality measures or clustering characteristics of the gene network). Here, to overcome such limitations and to exploit a larger set of informational attributes available, we analyze a sizeable integrated set of biological, ontological and topological features (including interaction data and GO categories, among others) related to different collections of disease genes (including, but not limited to sets related to several inflammatory and dysmetabolic diseases) via a comprehensive machine learning (ML) approach, in order to discover recurring patterns of attributes associated to families of disease genes. In this way the chances of revealing complex, hidden topological, ontological and statistical properties of the genes under scrutiny is wider and the derived "signature" can be heuristically used in a discovery process to find further yet unknown disease genes. We show hurdles, discriminating capabilities and main results in sorting out and in reconstructing the feature sets, in selecting the appropriate ML approach and in analyzing the datasets.

machine learning disease genes network medicine
2018 Articolo in rivista metadata only access

Investigating transcription factor synergism in humans.

Proteins are the core and the engine of every process in cells thus the study of mechanisms that drive the regulation of protein expression, is essential. Transcription factors play a central role in this extremely complex task and they synergically co-operate in order to provide a fine tuning of protein expressions. In the present study, we designed a mathematically well-founded procedure to investigate the mutual positioning of transcription factors binding sites related to a given couple of transcription factors in order to evaluate the possible association between them. We obtained a list of highly related transcription factors couples, whose binding site occurrences significantly group together for a given set of gene promoters, identifying the biological contexts in which the couples are involved in and the processes they should contribute to regulate. Studio delle sinergie tra fattori di trascrizione nei promotori

transcription factors gene regulation biological process computational biology
2016 Articolo in rivista metadata only access

Nullomers and high order nullomers in genomic sequences

A nullomer is an oligomer that does not occur as a subsequence in a given DNA sequence, i.e. it is an absent word of that sequence. The importance of nullomers in several applications, from drug discovery to forensic practice, is now debated in the literature. Here, we investigated the nature of nullomers, whether their absence in genomes has just a statistical explanation or it is a peculiar feature of genomic sequences. We introduced an extension of the notion of nullomer, namely high order nullomers, which are nullomers whose mutated sequences are still nullomers. We studied different aspects of them: comparison with nullomers of random sequences, CpG distribution and mean helical rise. In agreement with previous results we found that the number of nullomers in the human genome is much larger than expected by chance. Nevertheless antithetical results were found when considering a random DNA sequence preserving dinucleotide frequencies. The analysis of CpG frequencies in nullomers and high order nullomers revealed, as expected, a high CpG content but it also highlighted a strong dependence of CpG frequencies on the dinucleotide position, suggesting that nullomers have their own peculiar structure and are not simply sequences whose CpG frequency is biased. Furthermore, phylogenetic trees were built on eleven species based on both the similarities between the dinucleotide frequencies and the number of nullomers two species share, showing that nullomers are fairly conserved among close species. Finally the study of mean helical rise of nullomers sequences revealed significantly high mean rise values, reinforcing the hypothesis that those sequences have some peculiar structural features. The obtained results show that nullomers are the consequence of the peculiar structure of DNA (also including biased CpG frequency and CpGs islands), so that the hypermutability model, also taking into account CpG islands, seems to be not sufficient to explain nullomer phenomenon. Finally, high order nullomers could emphasize those features that already make simple nullomers useful in several applications.

DNA sequence Absent word
2016 Articolo in rivista metadata only access

Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words

In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones. (C) 2015 Elsevier Ltd. All rights reserved. Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one.

Protein sequence Random sequence Combinatorics of words Amino acid association
2015 Articolo in rivista metadata only access

Natural vs. Random Protein Sequences: Discovering Combinatorics Properties on Amino Acid Words

Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambigu- ous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid contraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence in- distinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1, 047 nat- ural protein sequences and 10, 470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natu- ral proteins. We analize the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones

Protein sequence Random sequence Combinatorics of words Amino acid association
2014 Articolo in rivista metadata only access

Multi-scale Simulation of T Helper Lymphocyte Differentiation

The complex differentiation process of the CD4+ T helper lymphocytes shapes the form and the range of the immune response to different antigenic challenges. Few mathematical and computational models have addressed this key phenomenon. We here present a multiscale approach in which two different levels of description, i.e. a gene regulatory network model and an agent-based simulator for cell population dynamics, are integrated into a single immune system model. We illustrate how such model integration allows bridging a gap between gene level information and cell level population, and how the model is able to describe a coherent immunological behaviour when challenged with different stimuli.

CD4+ T cell differentiation CD4+ T cell dogma Computational immunology Gene regulatory networks Immunoinformatics T helper lymphocyte
2013 Articolo in rivista metadata only access

Identifying Correlations between Chromosomal Proximity of Genes and Distance of Their Products in Protein-Protein Interaction Networks of Yeast

In this article we present evidence for a relationship between chromosome gene loci and the topological properties of the protein-protein interaction network corresponding to the set of genes under consideration. Specifically, for each chromosome of the Saccharomyces cerevisiae genome, the distribution of the intra-chromosome inter-gene distances was analyzed and a positive correlation with the distance among the corresponding proteins of the protein-protein interaction network was found. In order to study this relationship we used concepts based on non-parametric statistics and information theory. We provide statistical evidence that if two genes are closely located, then it is likely that their protein products are closely located in the protein-protein interaction network, or in other words, that they are involved in the same biological process.

2012 Articolo in rivista metadata only access

Characterizing protein shape by a volume distribution asymmetry index

Arrigo Nicola ; Paci Paola ; Di Paola Luisa ; Santoni Daniele ; de Ruvo Micol ; Giuliani Alessandro ; Castiglione Filippo

A fully quantitative shape index relying upon the asymmetry of mass distribution of protein molecules along the three space dimensions is proposed. Multidimensional statistical analysis, based on principal component extraction and subsequent linear discriminant analysis, showed the presence of three major 'attractor forms' roughly correspondent to rod-like, discoidal and spherical shapes. This classification of protein shapes was in turn demonstrated to be strictly connected with topological features of proteins, as emerging from complex network invariants of their contact maps. © Arrigo et al.

Principal component analysis Protein contact network Protein shape Topological indices
2011 Articolo in rivista metadata only access

Immunological network signatures of cancer progression and survival

Clancy T ; Pedicini M ; Castiglione F ; Santoni D ; Nygaard V ; Lavelle TJ ; Benson M ; Hovig E
2011 Articolo in rivista metadata only access

Immunological network signatures of cancer progression and survival

Trevor Clancy ; Marco Pedicini ; Filippo Castiglione ; Daniele Santoni ; Vegard Nygaard ; Timothy J Lavelle ; Mikael Benson ; Eivind Hovig
2011 Articolo in rivista metadata only access

CTLs' repertoire shaping in the thymus: A Monte Carlo simulation

Motivation: The human immune system evolved a multi-layered control mechanism to eliminate self-reactive cells. Of these so-called tolerance induction mechanisms, lymphocytes T education in the thymus gland represents the very first one. This complicated process is not fully understood and quantitative models able to help in this endeavor are lacking. Here, we present a stochastic computational model of the thymus which combines data-driven prediction methods and a novel method based on protein-protein potential measurements for assessing molecular binding among cell receptors, major histocompatibility complex (MHC) molecules, and self-peptides. Results: Of all possible specificities of immature T cells entering the thymus, only a small fraction is actually selected for maturation. Monte Carlo simulations of thymocytes selection in the thymus are performed varying the size of the self and a parameter determining the number of encounter with antigen-presenting cells (APCs). We score the fraction of self-reacting thymocytes leaving the thymus as mature naive T cells and show that self-reactivity is only marginally dependent on the number of self-molecules presented by APCs, while it is strongly affected by a parameter proportional to the time spent in the thymus. We study how this measure changes when we vary the number of MHC alleles and found an optimal number not too different from what we have in reality. The main result of this study is more methodological than biological as we show that immunoinformatics data and methods can be used in systemic level simulation of immune processes. © 2011 Informa UK, Ltd.

computational biology immunoinformatics Monte Carlo simulation repertoire thymus selection