重庆理工大学学报(自然科学) ›› 2023, Vol. 37 ›› Issue (9): 208-216.

• 信息·计算机 • 上一篇    下一篇

异构分布式深度学习平台的构建和优化方法研究

胡昌秀,张仰森,彭 爽,陈 涵   

  1. 北京信息科技大学 智能信息处理研究所,北京 100101; 2.国家经济安全预警工程北京实验室,北京 100044; 3.东北师范大学 文学院,长春 130022
  • 出版日期:2023-10-17 发布日期:2023-10-17
  • 作者简介:胡昌秀,男,硕士,主要从事自然语言处理、网络内容安全研究,Email:1091173065@qq.om;通信作者 张仰森, 男,博士,教授,主要从事中文信息处理、网络内容安全及过滤、智能仓储与物流、数据挖掘研究,Email:zhangy angsen@163.com。

Research on the construction and optimization of heterogeneous distributed deep learning platform

  • Online:2023-10-17 Published:2023-10-17

摘要: 深度学习与大数据技术的结合在资源管理、任务调度等方面还存在许多问题,有待 解决与优化。针对异构资源管理能力弱、原生调度算法灵活性差、多框架缺少统一的使用接口 3个问题,提出了一种异构资源下分布式深度学习框架整合平台,并对任务调度算法的优化进 行研究。平台以 Spark框架为基础,向下对异构资源进行拓展与管理,向上整合了 SparkOnAngel 与 TensorFlowOnSpark2种框架,使用物理标注的方法,为挂载不同计算资源的机器打上不同的 标签,并借助资源模型的双重表示,进行调度算法优化。结果表明:该平台与传统的 spark集群 相比,在 5个 minist_spark与 5个 WordCount混合任务场景下,执行耗时降低 13.4%;在大批量 的 WordCount任务场景下,当作业量达到 60时,执行耗时可降低至 32.31%。平台能够扩展对 GPU资源的管理,调度算法更加灵活高效,可为多个框架提供统一的调用接口。

关键词: 异构, 调度算法, 资源管理, 统一接口

Abstract: The combination of deep learning and big data technology is the general trend. There are still many problems to be solved and optimized in terms of resource management and task scheduling. Aiming at the three problems of weak management ability of heterogeneous resources, poor flexibility of native scheduling algorithms, and lack of a unified interface for multiple frameworks, a distributed deep learning framework integration platform under heterogeneous resources is proposed, and the optimization of task scheduling algorithms is studied. Based on the Spark framework, the platform expands and manages heterogeneous resources downwards, integrates the two frameworks SparkOnAngel and TensorFlowOnSpark upwards, and uses physical labeling to label machines with different computing resources. The dual representation of the model is used to optimize the scheduling algorithm. The results show that compared with the traditional spark cluster, the execution time of this platform is reduced by 13.4% in the mixed task scenario of 5 minist_spark and 5 WordCount tasks; can be reduced to 32.31%. The platform can expand the management of GPU resources, make the scheduling algorithm more flexible and efficient, and provide a unified calling interface for multiple frameworks.

中图分类号: 

  • TP393