In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text Classification Algorithms: A Survey. Information 2019, 10, 150.
Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results in barrier dysfunction but is thought to be caused by an increased vulnerability to infections. EE has been implicated as the predominant cause of under-nutrition, oral vaccine failure, and impaired cognitive development in low-and-middle-income countries. Both conditions require a tissue biopsy for diagnosis, and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose a convolutional neural network (CNN) to classify duodenal biopsy images from subjects with CD, EE, and healthy controls. We evaluated the performance of our proposed model using a large cohort containing 1000 biopsy images. Our evaluations show that the proposed model achieves an area under ROC of 0.99, 1.00, and 0.97 for CD, EE, and healthy controls, respectively. These results demonstrate the discriminative power of the proposed model in duodenal biopsies classification.
Kowsari, K., Sali, R., Khan, M. N., Adorno, W., Ali, S. A., Moore, S. R., ... & Brown, D. E. (2019). Diagnosis of Celiac Disease and Environmental Enteropathy on Biopsy Images Using Color Balancing on Convolutional Neural Networks. (Accepted FTC 2019)
The wide implementation of electronic health record (EHR) systems facilitates the collection of large-scale health data from real clinical settings. Despite the significant increase in adoption of EHR systems, this data remains largely unexplored, but presents a rich data source for knowledge discovery from patient health histories in tasks such as understanding disease correlations and predicting health outcomes. However, the heterogeneity, sparsity, noise, and bias in this data present many complex challenges. This complexity makes it difficult to translate potentially relevant information into machine learning algorithms. In this paper, we propose a computational framework, Patient2Vec, to learn an interpretable deep representation of longitudinal EHR data which is personalized for each patient. To evaluate this approach, we apply it to the prediction of future hospitalizations using real EHR data and compare its predictive performance with baseline methods. Patient2Vec produces a vector space with meaningful structure and it achieves an AUC around 0.799 outperforming baseline methods. In the end, the learned feature importance can be visualized and interpreted at both the individual and population levels to bring clinical insights.
Zhang, Jinghe, Kamran Kowsari, James H. Harrison, Jennifer M. Lobo, and Laura E. Barnes. "Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record." IEEE ACCESS 2018
Automatic understanding of domain-specific texts in order to extract useful relationships for later use is a non-trivial task. One such relationship would be between railroad accidents' causes and their correspondent descriptions in reports. From~2001 to 2016 rail accidents in the U.S. cost more than~\$4.6B. Railroads involved in accidents are required to submit an accident report to the Federal Railroad Administration~(FRA).
These reports contain a variety of fixed field entries including the primary cause of the accidents (a coded variable with~389 values) as well as a narrative field which is a short text description of the accident.
Although these narratives provide more information than a fixed field entry, the terminologies used in these reports are not easy to understand by a non-expert reader. Therefore, providing an assisting method to fill in the primary cause from such domain-specific texts~(narratives) would help to label the accidents with more accuracy. Another important question for transportation safety is whether the reported accident cause is consistent with narrative description. To address these questions, we applied deep learning methods together with powerful word embeddings such as Word2Vec and GloVe to classify accident cause values for the primary cause field using the text in the narratives. The results show that such approaches can both accurately classify accident causes based on report narratives and find important inconsistencies in accident reporting
Mojtaba Heidarysafa, Kamran Kowsari, Laura E. Barnes and Donald E. Brown "Analysis of Railway Accidents’ Narratives UsingDeep Learning." 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018
The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RDML can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RDML produces consistently better performance than standard methods over a broad range of data types and classification problems.
Kamran Kowsari, Mojtaba Heidarysafa, Donald E. Brown, Kiana Jafari Meimandi, and Laura E. Barnes. 2018. RMDL: Random Multimodel Deep Learning for Classification. In ICISDM ’18: 2018 2nd International Conference on Information System and Data Mining ICISDM ’18, April 9–11, 2018, Lakeland, FL, USA. ACM, New York, NY, USA
Suicide is the second leading cause of death among youngadults but the challenges of preventing suicide are significantbecause the signs often seem invisible. Research has shownthat clinicians are not able to reliably predict when someoneis at greatest risk. In this paper, we describe the design, col-lection, and analysis of text messages from individuals witha history of suicidal thoughts and behaviors to build a modelto identify periods of suicidality (i.e., suicidal ideation andnon-fatal suicide attempts). By reconstructing the timelineof recent suicidal behaviors through a retrospective clinicalinterview, this study utilizes a prospective research designto understand if text communications can predict periods ofsuicidality versus depression. Identifying subtle clues in com-munication indicating when someone is at heightened risk ofa suicide attempt may allow for more effective prevention ofsuicide.
Alicia L. Nobles, Jeffrey J. Glenn, Kamran Kowsari, Bethany A. Teachman, Laura E. Barnes "Identification of Imminent Suicide Risk Among Young Adults using Text Messages" Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems
Social anxiety disorder affects approximately 7\% of the adult population in the U.S., yet a vast majority of these individuals do not seek treatment. Thus, it is critical to examine models that deliver treatment to them. Computerized Cognitive Bias Modification (CBM) training programs can be effective in targeting interpretation bias, a key cognitive mechanism underlying social anxiety, and have potential for widespread dissemination, especially if they can be delivered via smart phones, which are becoming ubiquitous. However, the efficacy of CBM interpretation training paradigms that are adapted to and delivered via smart phones remains unknown. We present a pilot study to investigate if physiologic data can be used to track the changes over a smartphone-based CBM intervention for social anxiety. In a 3-week open trial, pilot study involving 20 high socially anxious participants, self-report affect ratings, heart rate and accelerometer data were collected using a smartphone and smartwatch before, after, and during the CBM intervention. The study focused on the relationship between accelerometer and heart rate to track change following the intervention. Results provide preliminary evidence for the viability of using physiological data to identify the change in mental state influenced by CBM interventions.
Mehdi Boukhechba, Jiaqi Gong, Kamran Kowsari, Mawulolo Ameko, Karl Fua, Yu Chow, Philip , huang, Bethany Teachman, Laura Barnes "Physiological Changes over the Course of Cognitive Bias Modification for Social Anxiety" IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), 2018
This paper introduces a novel real-time Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) for big data classification task. The study of real-time algorithms addresses several major concerns, which are namely: accuracy, memory consumption, and ability to stretch assumptions and time complexity. Attaining a fast computational model providing fuzzy logic and supervised learning is one of the main challenges in the machine learning. In this research paper, we present FSL-BM algorithm as an efficient solution of supervised learning with fuzzy logic processing using binary meta-feature representation using Hamming Distance and Hash function to relax assumptions. While many studies focused on reducing time complexity and increasing accuracy during the last decade, the novel contribution of this proposed solution comes through integration of Hamming Distance, Hash function, binary meta-features, binary classification to provide real time supervised method. Hash Tables (HT) component gives a fast access to existing indices; and therefore, the generation of new indices in a constant time complexity, which supersedes existing fuzzy supervised algorithms with better or comparable results. To summarize, the main contribution of this technique for real-time Fuzzy Supervised Learning is to represent hypothesis through binary input as meta-feature space and creating the Fuzzy Supervised Hash table to train and validate model.
Kowsari, Kamran, Nima Bari, Roman Vichr, and Farhad A. Goodarzi. "FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification." FICC 2018.
The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number of documents has increased. This is because along with this growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification~(HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.
K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, , M. S. Gerber, and L. E. Barnes, “Hdltex: Hierarchical deep learning for text classification,” 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017
String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (Σ) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate cooccurrence of mismatched substrings resulting in a time cost proportional to O(ΣM). We propose a fast algorithm for calculating Gapped k-mer Kernel using Counting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger Σ and M, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the Σ M term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperformsits speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12
datasets), and character-based English text (2 datasets), respectively
Singh, Ritambhara, Arshdeep Sekhon, Kamran Kowsari, Jack Lanchantin, Beilun Wang, and Yanjun Qi. "GaKCo: a Fast GApped k-mer string Kernel using COunting." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2017)
We introduce RNA2DNAlign, a computational framework for quantitative assessment of allele counts across paired RNA and DNA sequencing datasets. RNA2DNAlign is based on quantitation of the relative abundance of variant and reference read counts, followed by binomial tests for genotype and allelic status at SNV positions between compatible sequences. RNA2DNAlign detects positions with differential allele distribution, suggesting asymmetries due to regulatory/structural events. Based on the type of asymmetry, RNA2DNAlign outlines positions likely to be implicated in RNA editing, allele-specific expression or loss, somatic mutagenesis or loss-ofheterozygosity (the first three also in a tumor-specific setting). We applied RNA2DNAlign on 360 matching normal and tumor exomes and transcriptomes from 90 breast cancer patients from TCGA. Under highconfidence settings, RNA2DNAlign identified 2038 distinct SNV sites associated with one of the aforementioned asymetries, the majority of which have not been linked to functionality before. The performance assessment shows very high specificity and sensitivity, due to the corroboration of signals across multiple matching datasets. RNA2DNAlign is freely available from http://github.com/HorvathLab/NGS as a self-contained binary package for 64-bit Linux systems.
Movassagh, Mercedeh, Nawaf Alomran, Prakriti Mudvari, Merve Dede, Cem Dede, Kamran Kowsari, Paula Restrepo et al. "RNA2DNAlign: nucleotide resolution allele asymmetries through quantitative assessment of RNA and DNA paired sequencing data." Nucleic Acids Research (2016): gkw757.
This paper introduces a novel weighted unsupervised learning for object detection using an RGB-D camera. This technique is feasible for detecting the moving objects in the noisy environments that are captured by an RGB-D camera. The main contribution of this paper is a real-time algorithm for detecting each object using weighted clustering as a separate cluster. In a preprocessing step, the algorithm calculates the pose 3D position X, Y, Z and RGB color of each data point and then it calculates each data point’s normal vector using the point’s neighbor. After preprocessing, our algorithm calculates k-weights for each data point; each weight indicates membership. Resulting in clustered objects of the scene.
Kamran Kowsari and Manal H. Alassaf, “Weighted Unsupervised Learning for 3D Object Detection” International Journal of Advanced Computer Science and Applications(IJACSA), 7(1), 2016. http://dx.doi.org/10.14569/IJACSA.2016.070180
Searching through a large volume of data is very critical for companies, scientists, and searching engines applications due to time complexity and memory complexity. In this paper, a new technique of generating FuzzyFind Dictionary for text mining was introduced. We simply mapped the 23 bits of the English alphabet into a FuzzyFind Dictionary or more than 23 bits by using more FuzzyFind Dictionary, and reflecting the presence or absence of particular letters. This representation preserves closeness of word distortions in terms of closeness of the created binary vectors within Hamming distance of 2 deviations. This paper talks about the Golay Coding Transformation Hash Table and how it can be used on a FuzzyFind Dictionary as a new technology for using in searching through big data. This method is introduced by linear time complexity for generating the dictionary and constant time complexity to access the data and update by new data sets, also updating for new data sets is linear time depends on new data points. This technique is based on searching only for letters of English that each segment has 23 bits, and also we have more than 23-bit and also it could work with more segments as reference table.
Kamran Kowsari, Maryam Yammahi, Nima Bari, Roman Vichr, Faisal Alsaby and Simon Y. Berkovich, “Construction of FuzzyFind Dictionary using Golay Coding Transformation for Searching Applications” International Journal of Advanced Computer Science and Applications(ijacsa), 6(3), 2015. http://dx.doi.org/10.14569/IJACSA.2015.060313
Big Data is the new term of the exponential growth of data in the Internet. The importance of Big Data is not about how large it is, but about what information you can get from analyzing these data. Such analysis would help many businesses on making smarter decisions, and provide time and cost reduction. Therefore, to make such analysis, you will definitely need to search the large files on Big Data. Big Data is such a construction where sequential search is prohibitively inefficient, in terms of time and energy. Therefore, any new technique that allows very efficient search in very large files is highly demanded. The paper presents an innovative approach for efficient searching with fuzzy criteria in very large information systems(Big Data). Organization of efficient access to a large amount of information by an "approximate" or "fuzzy" indication is a rather complicated Computer Science problem. Usually, the solution of this problem relies on a brute force approach, which results in sequential look-up of the file. In many cases, this substantially undermines system performance. The suggested technique in this paper uses different approach based on the Pigeonhole Principle. It searches binary strings that match the given request approximately. It substantially reduces the sequential search operations and works extremely efficiently from several orders of magnitude including speed, cost and energy. This paper presents a complex developed scheme for the suggested approach using a new data structure, called FuzzyFind Dictionary. The developed scheme provides more accuracy than the basic utilization of the suggested method. It also, works much faster than the sequential search.
Yammahi, Maryam, Kamran Kowsari, Chen Shen, and Simon Berkovich. "An Efficient Technique for Searching Very Large Files with Fuzzy Criteria Using the Pigeonhole Principle." In Computing for Geospatial Research and Application (COM. Geo), 2014 Fifth International Conference on, pp. 82-86. IEEE, 2014. http://dx.doi.org/10.1109/COM.Geo.2014.8
The evolution and affordability of depth cameras like Microsoft Kinect make it a great source for object detection and surveillance monitoring. Information available from depth cameras includes depth in addition to color. Using depth cameras, the provided depth information can be incorporated for object detection in still and video images, but needs special care to pair it with color information. In this work, we propose a simple, yet novel real time unsupervised object detection method in spatio-temporal videos. The RGB color frame is mapped into Hunter Lab color space to reduce emphasis on image illuminations, while the depth frame is back-projected into the 3D real world coordinate in order to distinguish between objects in space. Once combined, the mapped color information and the back-projected depth information are fed into automatic, unsupervised clustering framework in order to detect scene objects. The framework runs in parallel to provide real time spatio-temporal object detection.
Alassaf, Manal H., Kamran Kowsari, and James K. Hahn. "Automatic, Real Time, Unsupervised Spatio-temporal 3D Object Detection Using RGB-D Cameras." In Information Visualisation (iV), 2015 19th International Conference on, pp. 444-449. IEEE, 2015. http://dx.doi.org/10.1109/iV.2015.80
The global influence of Big Data is not only growing but seemingly endless. The trend is leaning towards knowledge that is attained easily and quickly from massive pools of Big Data. Today we are living in the technological world that Dr. Usama Fayyad and his distinguished research fellows discussed in the introductory explanations of Knowledge Discovery in Databases (KDD) predicted nearly two decades ago. Indeed, they were precise in their outlook on Big Data analytics. In fact, the continued improvement of the interoperability of machine learning, statistics, database building and querying fused to create this increasingly popular science-Data Mining and Knowledge Discovery. The next generation computational theories are geared towards helping to extract insightful knowledge from even larger volumes of data at higher rates of speed. As the trend increases in popularity, the need for a highly adaptive solution for knowledge discovery will be necessary. In this research paper, we are introducing the investigation and development of 23 bit-questions for a Metaknowledge template for Big Data Processing and clustering purposes. This research aims to demonstrate the construction of this methodology and proves the validity and the beneficial utilization that brings Knowledge Discovery from Big Data.
Bari, Nima, Roman Vichr, Kamran Kowsari, and Simon Berkovich. "23-bit metaknowledge template towards big data knowledge discovery and management." In Data Science and Advanced Analytics (DSAA), 2014 International Conference on, pp. 519-526. IEEE, 2014. http://dx.doi.org/10.1109/DSAA.2014.7058121
Past research has challenged us with the task of showing relational patterns between text-based data and then clustering for predictive analysis using Golay Code technique. We focus on a novel approach to extract metaknowledge in multimedia datasets. Our collaboration has been an on-going task of studying the relational patterns between data points based on met features extracted from metaknowledge in multimedia datasets. Those selected are significant to suit the mining technique we applied, Golay Code algorithm. In this research paper we summarize findings in optimization of metaknowledge representation for 23-bit representation of structured and unstructured multimedia data in order to be processed in 23-bit Golay Code for cluster recognition.
Bari, Nima, Roman Vichr, Kamran Kowsari, and Simon Y. Berkovich. "Novel Metaknowledge-Based Processing Technique for Multimediata Big Data Clustering Challenges." In Multimedia Big Data (BigMM), 2015 IEEE International Conference on, pp. 204-207. IEEE, 2015. http://dx.doi.org/10.1109/BigMM.2015.78
Integrative Next Generation Sequencing (NGS) DNA and RNA analyses have very recently become feasible, and the published to date studies have discovered critical disease implicated pathways, and diagnostic and therapeutic targets. A growing number of exomes, genomes and transcriptomes from the same individual are quickly accumulating, providing unique venues for mechanistic and regulatory features analysis, and, at the same time, requiring new exploration strategies. In this study, we have integrated variation and expression information of four NGS datasets from the same individual: normal and tumor breast exomes and transcriptomes. Focusing on SNPcentered variant allelic prevalence, we illustrate analytical algorithms that can be applied to extract or validate potential regulatory elements, such as expression or growth advantage, imprinting, loss of heterozygosity (LOH), somatic changes, and RNA editing. In addition, we point to some critical elements that might bias the output and recommend alternative measures to maximize the confidence of findings. The need for such strategies is especially recognized within the growing appreciation of the concept of systems biology: integrative exploration of genome and transcriptome features reveal mechanistic and regulatory insights that reach far beyond linear addition of the individual datasets.
Mudvari, Prakriti, Kamran Kowsari, Charles Cole, Raja Mazumder, and Anelia Horvath. "Extraction of molecular features through exome to transcriptome alignment." Journal of metabolomics and systems biology 1, no. 1 (2013).
We have developed a computational approach, called SNPlice, for identifying cis-acting, splice-modulating variants from RNA-seq datasets. SNPlice mines RNA-seq datasets to find reads that span single-nucleotide variant (SNV) loci and nearby splice junctions, assessing the co-occurrence of variants and molecules that remain unspliced at nearby exon–intron boundaries. Hence, SNPlice highlights variants preferentially occurring on intron-containing molecules, possibly resulting from altered splicing. To illustrate co-occurrence of variant nucleotide and exon–intron boundary, allele-specific sequencing was used. SNPlice results are generally consistent with splice-prediction tools, but also indicate splice-modulating elements missed by other algorithms. SNPlice can be applied to identify variants that correlate with unexpected splicing events, and to measure the splice-modulating potential of canonical splice-site SNVs.
Mudvari, Prakriti, Mercedeh Movassagh, Kamran Kowsari, Ali Seyfi, Maria Kokkinaki, Nathan J. Edwards, Nady Golestaneh, and Anelia Horvath. "SNPlice: variants that modulate Intron retention from RNA-sequencing data." Bioinformatics (2014): btu804.