中国组织工程研究 ›› 2010, Vol. 14 ›› Issue (17): 3099-3133.doi: 10.3969/j.issn.1673-8225.2010.17.015

• 数字化骨科 digital orthopedics • 上一篇    下一篇

基于遗传算法与支持向量机的基因微阵列分析

汪  伟,刘  红   

  1. 首都医科大学,北京市  100069
  • 出版日期:2010-04-23 发布日期:2010-04-23
  • 通讯作者: 刘 红,首都医科大学,北京市 100069
  • 作者简介:汪 伟★,男,1975年生,北京市人,汉族,首都医科大学在读硕士,高级工程师,主要从事生物医学信息处理技术研究。 wwei_mail@163.com
  • 基金资助:

    北京市教育委员会科技发展计划面上项目(KM200910025006)。

Genetic algorithm and support vector machine-based gene microarray analysis

Wang Wei, Liu Hong   

  1. Capital Medical University, Beijing   100069, China
  • Online:2010-04-23 Published:2010-04-23
  • Contact: Liu Hong, Capital Medical University, Beijing 100069, China
  • About author:Wang Wei★, Studying for master’s degree, Senior engineer, Capital Medical University, Beijing 100069, China wwei_mail@163.com
  • Supported by:

    the Science and Technology Development General Program of Beijing Education Commission, No. KM200910025006*

摘要:

背景:微阵列数据的特点是样本含量小,而变量数(基因)多达上万个。此时,传统的统计方法往往因为高维而失效了。遗传算法和支持向量机是近年来发展迅速的机器学习算法,具有很好的分类效果与降维优势。
目的:提出将遗传算法与支持向量机结合起来对样本进行分类,并与直接采用支持向量机、筛选差异表达基因后采用支持向量机的结果进行比较。
方法:采用Bioconductor提供的数据集golub,它是白血病微阵列芯片实验所得的基因表达数据集,对全部基因采用支持向量机进行分类。采用SAM软件对芯片数据的显著性分析确定不同的差异表达基因并估计错误发现率FDR,以筛选出的76个差异表达基因作为特征基因子集,再采用支持向量机进行分类。将筛选出的76个差异表达基因作为初始的特征基因集合,采用遗传算法-支持向量机再次进行特征基因选择,提高分类准确度,并与全部基因直接采用支持向量机、筛选差异表达基因后采用支持向量机的结果进行比较。同时也对特征基因在代谢通路上的分布和功能作了一定的研究。
结果与结论:通过遗传算法降维可以提高支持向量机的分类准确率,特别是剔除了数据中的大量无关基因和噪声,使得经过特征选择后分类准确率提高。结果显示遗传算法与支持向量机结合方法对分类更加有效。此外,通路分析结果显示特征基因的主要功能体现在信号传导和氨基酸代谢上。

关键词: 遗传算法, 支持向量机, 微阵列, 通路, 数字化医学

Abstract:

BACKGROUND: Gene microarray data has small sample size and large numbers of variates. Traditional statistical method is not effective. Genetic algorithm (GA) and support vector machine (SVM) are machine learning algorithms developed rapidly in recent years, which can decrease the dimension of features.
OBJECTIVE: To combine GA and SVM to classify samples and compare with other two processes in which all genes and difference expression genes are taken as classifiers, respectively.
METHODS: We applied golub data set provided by Bioconductor, which included gene expression data of leukaemia samples and normal samples. All genes were used to classify samples with SVM. SAM software was used to extract difference expression genes and estimate False Discovery Rate. Finally, 76 difference expression genes were used as feature gene set to classify samples with SVM and GA-SVM respectively. Three classification effects were compared. Additionally, the distribution and function about feature genes in KEGG pathways were also discussed.
RESULTS AND CONCLUSION: The accuracy of classification of SVM was improved by decreasing dimension with genetic algorithm. In particular, this process eliminated a great deal of redundant genes and noises, which improves the classification performance. Results show that GA-SVM algorithm is effective in classifying samples. In addition, the pathway analysis shows that signal transmission and amino acid metabolism are two major functions of feature genes.

中图分类号: