GA»Æ½ð¼×

¡°¼«¼ò¡±»ÀР¡¤ È«ÓòÖÇÁª Ø­ GA»Æ½ð¼×м«¼òÁ캽ÏÂÒ»´úÐ£Ô°Íø½¨Éè×êÑлá
date
Ô¤Ô¼Ö±²¥
ÎÞ¸Ð×¼Èë ÈËÎïͳ¹Ü Ø­ RG-SAM+5.X ÐÂÒ»´ú¸ßУAIÈÏ֤ƽ̨°ä²¼
date
Ô¤Ô¼Ö±²¥
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾
²úÆ·
< ·µ»ØÖ÷²Ëµ¥
²úÆ·ÖÐÐÄ
²úÆ·
½â¾ö¹æ»®
< ·µ»ØÖ÷²Ëµ¥
½â¾ö¹æ»®ÖÐÐÄ
ÐÐÒµ
ºÏ×÷ͬ°é
·µ»ØÖ÷²Ëµ¥
Ñ¡ÔñÇøÓò/˵»°
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

GA»Æ½ð¼×ÍøÂç¸ß»úÄÜÍøÂç¹æ»® £¬ÎªAIGCÂòͨ ¡°Èζ½¶þÂö¡±

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾ °ä²¼¹¦·ò£º2023-03-20
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ÒýÑÔ

AIGC£¨AI-Generated Content £¬ÈËΪÖÇÄܳö²úÄÚÈÝ£©½üÆÚ·¢Õ¹Ñ¸ÃÍ £¬µü´ú¿ìÂʸüÊdzöÏÖÖ¸Êý¼¶µÄ·¢×÷ʽÔö³¤¡£ÆäÖÐ £¬GPT-4ºÍÎÄÐÄÒ»ÑÔµÄÍÆ³öÒýÆðÁËÈËÃÇ¶ÔÆäóÒ×¼ÛÖµºÍÀûÓó¡¾°µÄ¸ß¶È¹Ø×¢¡£Ëæ×ÅAIGCµÄ·¢Õ¹ £¬ÑµÁ·Ä£ÐͲÎÊý¹æÄ£´ÓǧÒÚµ½ÍòÒÚ¼¶±ð £¬µ×²ãGPUÖ§³Ö¹æÄ£Ò²´ïµ½ÁËÍò¿¨¼¶±ð¡£Óɴ˵¼ÖµÄÍøÂç¹æÄ£²»ÐÝÔö´ó £¬ÍøÂç½Úµã¼äͨѼûæ¶Ô×ÅÔ½À´Ô½´óµÄÌôÕ½¡£Ôڴ˲¼¾°Ï £¬ÈôºÎÌáÉýAI·þÎñÆ÷ÍÆËãÄÜÁ¦ºÍ×éÍøÍ¨Ñ¶ÄÜÁ¦²¢Á½È«³É±¾ £¬ÒѳÉΪµ±Ç°ÈËΪÖÇÄÜÁìÓòµÄ³ÁÒª×êÑз½ÏòÖ®Ò»¡£

GA»Æ½ð¼×ÍøÂçÕë¶ÔAIGCËãÁ¦¡¢GPUÀûÓÃÂÊÓëÍøÂçµÄ¹ØÏµ £¬ÒÔ¼°Ö÷Á÷HPC×éÍøÃæ¶ÔµÄÌôÕ½ £¬ÍƳöÁËÒµ½çÏȽøµÄ“ÖÇ¿ì”DDC£¨Distributed Disaggregated Chassis £¬É¢²¼Ê½·Öɢʽ»úÏ䣩¸ß»úÄÜÍøÂç¹æ»® £¬ÎªAIGCÒµÎñÂòͨ“Èζ½¶þÂö” £¬ÖúÁ¦ËãÁ¦Í»·ÉÃͽø¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

GA»Æ½ð¼×ÍøÂçDDC²úÆ·Ïνӷ½Ê½Ê¾Òâͼ

AIGCËãÁ¦¡¢GPUÀûÓÃÂÊÓëÍøÂçµÄ¹ØÏµ

ChatGPTµÄѵÁ·¹¦·òºÍGPUÀûÓÃÂʵĹØÏµ

ÒÔChatGPTΪÀý £¬ÔÚËãÁ¦·½Ãæ £¬Ê¹ÓÃ΢ÈíAzure AI³¬Ëã»ù´¡ÉèÊ©£¨ÓÉ10000¿é V100 GPU×é³ÉµÄ¸ß´ø¿í¼¯Èº£©ÉϽøÐÐѵÁ· £¬×ÜËãÁ¦¿÷ËðÔ¼3640PF-days£¨¼´Ã¿ÃëһǧÍòÒÚ´ÎÍÆËã £¬ÔËÐÐ3640Ì죩 £¬ÕâÀï×ö¸ö¹«Ê½»»ËãÒ»ÏÂ10000¿éV100±ØÒªÑµÁ·¶à¾Ã£º

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ChatGPTËãÁ¦ºÍѵÁ·¹¦·ò±í

×¢£ºChatGPTËãÁ¦ÐèÒªÎªÍøÉÏ»ñÈ¡ £¬Ôڴ˽ö¹©²Î¿¼¡£OpenAI ÔÚËûÃǵÄÎÄÕ“AI and Compute”ÖÐÈç¹ûÀûÓÃÂÊΪ 33%¡£NVIDIA¡¢Ë¹Ì¹¸£ºÍ΢ÈíµÄÒ»×é×êÑÐÈËÔ±ÔÚÉ¢²¼Ê½ÏµÍ³ÉÏѵÁ·´óÐÍ˵»°Ä£Ð͵ÄÀûÓÃÂÊ´ïµ½ÁË 44% µ½ 52%¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ChatGPT¹ØÓÚѵÁ·¹¦·òµÄ»Ø¸²

ƾ¾ÝChatGPTµÄ»Ø¸´À´¿´ £¬±ÈÁ¦ÇкÏÉÏÃæ±í¸ñÍÆËã³öÀ´µÄ¹¦·ò £¬ÀûÓÃÂÊÓ¦¸Ã»áÔÚ50%×óÓÒ¡£

Äܹ»¿´³öÓ°ÏìÒ»¸öÄ£Ð͵ÄѵÁ·Ê±³¤ÖØÒª³É·ÖÔÚÓÚGPUµÄÀûÓÃÂÊ £¬ÒÔ¼°GPU¼¯Èº´¦ÖÃÄÜÁ¦¡£¶øÕâЩ¹Ø¼üÖ¸±êÓÖÓëÍøÂçЧÄÜÇ×êÇÓйØ¡£ÍøÂçЧÄÜÊÇÓ°ÏìAI¼¯ÈºÖÐGPUÀûÓÃÂʵÄÒ»¸ö³ÁÒª³É·Ö¡£ÔÚAI¼¯ÈºÖÐ £¬GPUͨ³£ÊÇÍÆËã½ÚµãµÄÖ÷Ìâ×ÊÔ´ £¬ÓÉÓÚËüÃÇÄܹ»¸ßЧµØ´¦Öôó¹æÄ£µÄÉî¶È½ø½¨¹¤×÷¡£È»¶ø £¬GPUµÄÀûÓÃÂÊÊܵ½¶à¸ö³É·ÖµÄÓ°Ïì £¬ÆäÖÐÍøÂçЧÄÜÊÇÒ»¸ö¹Ø¼ü³É·Ö¡£

ÍøÂçЧÄÜÓëGPUÀûÓÃÂʵĹØÏµ

ÍøÂçÔÚAIѵÁ·ÖбíÑÝ×ÅÖÁ¹Ø³ÁÒªµÄ½ÇÉ«¡£AI¼¯ÈºÍ¨³£Óɶà¸öÍÆËã½ÚµãºÍ´æ´¢½Úµã×é³É £¬ÕâЩ½Úµã±ØÒªÆµÈԵؽøÐÐͨѶºÍÊý¾Ý»¥»»¡£ÈôÊÇÍøÂçЧÄܵÍÏ £¬ÕâЩ½ÚµãÖ®¼äµÄͨѶ½«»á±äµÃ»ºÂý £¬Õ⽫ֱ½ÓÓ°Ïìµ½AI¼¯ÈºµÄËãÁ¦¡£

µÍЧµÄÍøÂç¿ÉÄܵ¼ÖÂÒÔÏÂÎÊÌâ £¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ£º

Êý¾Ý´«Ê书·òÔö³¤£ºÔÚµÍЧµÄÍøÂçÖÐ £¬Êý¾Ý´«ÊäµÄ¹¦·ò½«»áÔö³¤¡£µ±GPU±ØÒªÆÚ´ýÊý¾Ý´«ÊäʵÏÖºóÄÜÁ¦½øÐÐÍÆËãʱ £¬GPUÀûÓÃÂʽ«»á½µµÍ£»

ÍøÂç´ø¿íÆ¿¾±£ºÔÚAI¼¯ÈºÖÐ £¬GPUͨ³£±ØÒªÆµÈÔµØÓëÆäËûÍÆËã½Úµã½øÐÐÊý¾Ý»¥»»¡£ÈôÊÇÍøÂç´ø¿í²»¼° £¬GPU½«ÎÞ·¨»ñµÃ×ã¹»µÄÊý¾Ý½øÐÐÍÆËã £¬´Ó¶øµ¼ÖÂGPUÀûÓÃÂʽµµÍ£»

¹¤×÷µ÷¶È²»Æ½ºâ£ºÔÚµÍЧµÄÍøÂçÖÐ £¬¹¤×÷¿ÉÄܻᱻ·ÖÅäµ½ÓëGPU·ÖÆçµÄÍÆËã½ÚµãÉÏ¡£µ±±ØÒª´óÁ¿µÄÊý¾Ý´«Êäʱ £¬Õâ¿ÉÄܻᵼÖÂGPUÏÐÖÃÆÚ´ý £¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ¡£

ΪÁËÌá¸ßGPUÀûÓÃÂÊ £¬±ØÒªÓÅ»¯ÍøÂçЧÄÜ¡£ÕâÄܹ»Í¨¹ýѡȡ¸ü¿ìµÄÍøÂç¼¼Êõ¡¢ÓÅ»¯ÍøÂçÍØÆË½á¹¹¡¢ºÏÀíÅäÖôø¿íµÈ²½ÖèÀ´ÊµÏÖ¡£ÔÚѵÁ·Ä£ÐÍÖÐ £¬É¢²¼Ê½ÑµÁ·µÄ²¢ÐжȣºÊý¾Ý²¢ÐÓ×¢ÕÅÁ¿²¢ÐÐÓëÁ÷Ë®²¢Ðоö¶¨ÁËGPU´¦ÖõÄÊý¾ÝÖ®¼äµÄͨѶģÐÍ¡£Ä£ÐÍÖ®¼äµÄͨѶЧÄÜÊܵ½ÒÔϼ¸¸ö³É·ÖµÄÓ°Ï죺

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

Ó°ÏìͨѶµÄ³É·Ö

ÆäÖÐ £¬´ø¿íºÍÉ豸ת·¢Ê±ÑÓÊܵ½Ó²¼þÏÞ¶È £¬¶Ë´¦ÖÃʱÑÓÊܼ¼ÊõÑ¡Ôñ£¨TCP or RDMA£©Ó°Ïì £¬RDMA»á¸üµÍ £¬ÁжӺͳÁ´«ÔòÊܵ½ÍøÂçÓÅ»¯ºÍ¼¼ÊõÑ¡ÔñµÄÓ°Ïì¡£

ƾ¾ÝÁ¿»¯Ä£ÐÍ[1]£ºGPUÀûÓÃÂÊ = GPUÄÚµü´úÍÆË㹦·ò/£¨GPUÄÚµü´úÍÆË㹦·ò+ÍøÂç×ÜÌåͨѶ¹¦·ò£©À´ÍÆËãµÃ³öÒÔϽáÂÛ£º

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

´ø¿íÍÌÍÂÓëGPUÀûÓÃÂʵÄÇúÏßͼ                                  ¶¯Ì¬Ê±ÑÓºÍGPUÀûÓÃÂʵÄÇúÏßͼ

Äܹ»¿´µ½ÍøÂç´ø¿íÍÌÍ¡¢¶¯Ì¬Ê±ÑÓ£¨ÓµÈû/¶ª°ü£©¶ÔGPUÀûÓÃÂÊÓ°ÏìÏÔÖø¡£

ƾ¾ÝͨѶ×ÜʱÑÓµÄ×é³ÉÀ´¿´£º

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ͨѶ×ÜʱÑÓ×é³Éͼ

¾²Ì¬Ê±ÑÓÏà½ÏÖ®ÏÂÓ°Ïì¸üÓ× £¬ËùÒÔ¸üÓ¦¸Ã×ųÁȥ˼¿¼ÈôºÎÏ÷¼õ¶¯Ì¬Ê±ÑÓ £¬ÕâÑùÄܹ»ÓÐЧµÄÌáÉýGPUµÄÀûÓÃÂÊ £¬´Ó¶ø´ïµ½ÌáÉýËãÁ¦µÄÖ¸±ê¡£

Ö÷Á÷HPC×éÍøÃæ¶ÔµÄÌôÕ½

IB×éÍø°º¹óÇÒ·â¹Ø

Infiniband×éÍøÊǵ±Ç°¸ß»úÄÜÍøÂçµÄ³ÉЧ×îÓŽâ £¬ÀûÓó¬¸ß´ø¿íºÍ»ùÓÚCreditµÄ»úÔìÈ·±£ÎÞÓµÈûºÍ³¬µÍʱÑÓ £¬µ«ÊÇÒ²ÊÇ×î°º¹óµÄ½â·¨ £¬Ïà±Èͬ´ø¿íÏ´«Í³ÒÔÌ«ÍøµÄ×éÍø»á¹óÊý±¶¡£Í¬Ê±Infiniband¼¼Êõ·â¹Ø £¬ÒµÄÚĿǰ³ÉÊ칩¸øÉ̽ö1¼Ò £¬¶ÔÓÚ×îÖÕÓû§À´Ëµ £¬ÎÞ·¨ÊµÏÖµÚ¶þ»õÔ´¡£

ËùÒÔÒµÄÚ´óÎÞÊýÓû§»áÑ¡Ôñ´«Í³ÒÔÌ«Íø×éÍøµÄ¹æ»®¡£

PFCºÍECN¿ÉÄÜ´¥·¢½µ¿ì

µ±Ç°¸ß»úÄÜÍøÂçÖ÷Á÷×éÍø¹æ»®ÊÇ»ùÓÚRoCE v2À´×齨֧³ÖRDMAµÄÍøÂç¡£ÆäÖгÁÒªµÄÁ½Ïî´îÅä¼¼ÊõÊÇPFCºÍECN £¬Á½Õß¾ùÊÇΪÁËÔ¤·ÀÁ´Â·ÖеÄÓµÈû¶ø²úÉúµÄ¼¼Êõ¡£

¶à¼¶PFC×éÍøÏ»áÕë¶Ô»¥»»»úÈë¿Ú£¨Ingress£©ÓµÈû £¬Öð¼¶·´Ñ¹µ½Ô´¶Ë·þÎñÆ÷ÔÝÍ£·¢ËÍ £¬»º½âÍøÂçÓµÈû £¬¶ã±Ü¶ª°ü£»µ«¸Ã¹æ»®Ôڶ༶×éÍøÏ¿ÉÄÜ»áÎî¶ÔPFC Deadlockµ¼ÖÂRDMAÁ÷Á¿ÖÕ³¡×ª·¢µÄ·çÏÕ¡£

ͼƬGA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

PFC¹¤×÷»úÔìʾÒâͼ

¶øECNÔò»á»ùÓÚ¶Ô»¥»»»ú³ö¿Ú£¨Egress£©ÓµÈûµÄÖ÷ÕŶ˸ÐÖª £¬Ö±½ÓÌìÉúÒ»¸öRoCEv2 CNP°ü֪ͨԴ¶Ë½µ¿ì £¬Ô´·þÎñÆ÷ÊÕµ½CNP±¨ÎÄ £¬¾«×¼½µµÍ¶ÔÓ¦QPµÄ·¢ËÍ¿ìÂÊ £¬»º½âÓµÈûµÄͬʱԤ·ÀÎÞ²î¾à½µ¿ì¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ECNÏóÕ÷λʾÒâͼ

ÕâÁ½Ïî¼¼Êõ×ÔÉí²¢Ã»ÓÐʲôÎÊÌâ £¬¶¼ÊÇΪÏàʶ¾öÓµÈû¶øµ®ÉúµÄ¼¼Êõ £¬µ«ÊÇѡȡÕâÖÖ¼¼Êõºó¿ÉÄÜ»á±»ÍøÂçÖпÉÄܲúÉúµÄÓµÈû¶øÆµÈÔ´¥·¢ £¬×îÖջᵼÖÂÔ´¶ËÔÝÍ£»ò½µ¿ì·¢ËÍ £¬Í¨Ñ¶´ø¿í»á½µµÍ £¬»á¶ÔGPUÀûÓÃÂʲúÉú±ÈÁ¦´óµÄÓ°Ïì £¬´Ó¶øÔì³ÉÕû¸ö¸ß»úÄÜÍøÂçµÄËãÁ¦±»À­µÍ¡£

ECMP²»Æ½ºâ¿ÉÄܻᵼÖÂÓµÈû

ÔÚAIѵÁ·ÍÆËãÖлáÓÐAll-ReduceºÍAll-to-AllÁ½ÖÖÖØÒªµÄÄ£ÐÍ £¬Á½ÖÖÄ£ÐͶ¼±ØÒªÆµÈԵĴÓÒ»¸öGPUµ½Áí±í¶à¸öGPU½øÐÐͨѶ¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

All-to-AllÄ£ÐÍ                       All-ReduceÄ£ÐÍ

ÔÚ´«Í³×éÍøÏ £¬ToRºÍLeafÉ豸ѡȡ·ÓÉ+ECMPµÄ×éÍøÄ£Ê½ £¬ECMP»á»ùÓÚÁ÷½øÐйþÏ£¸ºÔØÑ¡Â· £¬ÓÐÒ»ÖÖ¼«¶ËÇé¿ö¾ÍÊÇijһÌõECMPÁ´Â·ÓÉÓÚÒ»Ìõ´óÏóÁ÷¶øÅÜÂú £¬ÆäÓà¶àÌõECMPÁ´Â·Ïà¶Ô¿ÕÏÐ £¬Ôì³É¸ºÔز»¾ùµÄÇé¿ö¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

´«Í³ECMP²¿Êðͼ

ÔÚÄÚ²¿·ÂÕÕ8ÌõECMPÁ´Â·µÄ²âÊÔ»·¾³Ï £¬²âÊÔÁ˾ÖÈçÏ£º

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ECMPÁ÷Á¿²âÊÔÁ˾Ö

Äܹ»¿´³ö £¬»ùÓÚÁ÷µÄECMP»áÔì³É½ÏÏÔÖøµÄij¼¸ÌõÁ´Â·Õ¼Óã¨ECMP1-5ºÍ1-6£©ºÍ¿ÕÏУ¨ECMP1-0ÖÁ1-3½Ï¿ÕÏУ© £¬¶øÔÚAll-ReduceºÍAll-to-AllµÄÁ½ÖÖÄ£ÐÍÏ £¬ ¾ÍºÜÈÝÒ×Ôì³ÉÒ»Ìõ·ÏßÓÉÓÚECMPµÄ¸ºÔز»¾ù¶øÓµÈû £¬Ò»µ©ÓµÈûÔì³É³Á´« £¬¾Í»áÌáÉý×ÜÌåµÄͨѶ×ÜʱÑÓ £¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ¡£

ËùÒÔ £¬ÎªÏàʶ¾ö´ËÀàÎÊÌâ £¬×êÑнçÌá³öÁËphost¡¢Homa¡¢NDP¡¢1RMA ºÍ AeolusµÈ·á˶µÄ½â¾ö¹æ»® £¬ËüÃÇÔÚ·ÖÆçˮƽÉϽâ¾öÁË incast £¬ »¹½â¾öÁ˸ºÔØÆ½ºâºÍµÍÑÓ³¤ÒªÇó/ÏìÓ¦Á÷Á¿µÄÎÊÌâ¡£µ«ÊÇÒ²´øÀ´ÁËеÄÌôÕ½ £¬ÍùÍùÕâЩ×êÑеĹ滮¶¼ÊDZØÒª¶Ëµ½¶ËÀ´½â¾öÎÊÌâ £¬¶ÔÖ÷»ú¡¢Íø¿¨¡¢ÍøÂçµÄŤת½Ï´ó £¬¶ÔÓÚͨ³£Óû§¶øÑÔ £¬³É±¾½Ï¸ß¡£

¿òʽ»¥»»»ú×éAI¼¯ÈºµÄÌôÕ½

º£±íÓв¿ÃÅ»¥ÁªÍø¹«Ë¾¼Äµ«Ô¸ÓÚÀûÓÃѡȡDNXоƬ֧³ÖVOQ¼¼ÊõµÄ¿òʽ»¥»»»úÀ´½â¾ö¸ºÔز»Æ½ºâ´øÀ´µÄ´ø¿íÀûÓÃÂʵ͵ÄÎÊÌâ £¬µ«Ò²Ãæ¶ÔÒÔϼ¸¸öÌôÕ½¡£

À©´óÄÜÁ¦Í¨³£ £¬»ú¿ò´óÓ×ÏÞ¶ÈÁË×î´ó¶Ë¿ÚÊý £¬ÈçÏë×ö¸ü´ó¹æÄ£µÄ¼¯Èº £¬±ØÒªºáÏòÀ©´ó¶à¸ö»ú¿ò £¬Ò²»á²úÉú¶à¼¶PFCºÍECMPµÄÁ´Â· £¬ËùÒÔ¿òÖ»ÊʺÏÓÚÓ×¹æÄ£²¿Êð£»

É豸¹¦ºÄ´ó £¬»ú¿òÄÚÏß¿¨Ð¾Æ¬¡¢FabricоƬ¡¢µçÉȵÈÊýÁ¿¶à¶à £¬µ¥É豸µÄ¹¦ºÄ¼«´ó £¬ÇáËɳ¬¹ý2ÍòÍß £¬ÓеÄÉõÖÁ3Íò¶àÍß £¬¶Ô»ú¹ñµçÁ¦ÒªÇó¸ß£»

µ¥É豸¶Ë¿ÚÊýÁ¿¶à £¬¹ÊÕÏÓò´ó¡£

ËùÒÔ»ùÓÚÒÔÉÏÔ­Òò £¬¿òʽÉ豸ֻÊʺÏÓ×¹æÄ£²¿ÊðAIÍÆË㼯Ⱥ¡£

ÐÂ״̬DDC²úÆ·µ®Éú £¬Ö§³ÖAIGC¸ß»úÄÜÍøÂç

DDCÊÇÒ»ÖÖÉ¢²¼Ê½½âñî»ú¿òÉ豸µÄ½â¾ö¹æ»® £¬Ñ¡È¡µÄоƬºÍ¹Ø¼ü¼¼ÊõÓ봫ͳ¿òʽ»¥»»»úÏÕЩһÑù £¬µ«DDC¼Ü¹¹µ¥Ò»Ö§³Öµ¯ÐÔÀ©´óºÍÖ°Äܼ±¾çµü´ú¡¢¸üÒײ¿Êð¡¢µ¥»ú¹¦ºÄµÍ¡£

ÈçÏÂͼËùʾ £¬ÒµÎñÏß¿¨×÷Ϊǰ¶Ë³ÉΪNCP½ÇÉ« £¬»¥»»Íø°å×÷Ϊºó¶Ë³ÉΪNCF½ÇÉ« £¬Ô­ÏÈÁ½ÕßÖ®¼äµÄÏÎ½ÓÆ÷×é¼þ´Ë¿Ì±»¹âÏËÏßÀ°ü°ì £¬Ô­ÓпòʽÉ豸µÄÖÎÀíÒýÇæÔÚDDC¼Ü¹¹ÖÐÒ²³ÉΪÁËNCC¼¯ÖÐ/É¢²¼Ê½µÄÖÎÀí×é¼þ¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

DDC²úÆ·Ïνӷ½Ê½Ê¾Òâͼ

DDCÖ§³Ö³¬´ó¹æÄ£²¿Êð

DDC¼Ü¹¹Ïà½ÏÓÚ¿òʽ¼Ü¹¹µÄÓÅÊÆÔÚÓÚÄܹ»Ìṩµ¯ÐÔ¿ÉÀ©´óÐÔ £¬×éÍø¹æÄ £Äܹ»Æ¾¾ÝAI¼¯Èº´óÓ×À´½Ã½ÝÑ¡Ôñ¡£

µ¥POD×éÍøÖÐ £¬Ñ¡È¡96̨NCP×÷Ϊ½ÓÈë £¬ÆäÖÐNCPÏÂÐй²36¸ö200G½Ó¿Ú £¬ÕƹÜÏνÓAIÍÆË㼯ȺµÄÍø¿¨¡£ÉÏÐй²40¸ö200G½Ó¿Ú×î´óÄܹ»ÏνÓ40̨NCF £¬NCFÌṩ96¸ö200G½Ó¿Ú £¬¸Ã¹æÄ£¸ßµÍÐдø¿íΪ³¬¿ì±È1.1:1¡£Õû¸öPOD¿ÉÖ§³Ö3456¸ö200GÍøÂç½Ó¿Ú £¬ÒÀÕÕһ̨·þÎñÆ÷Åä8¿éGPUÀ´ÍÆËã £¬¿ÉÖ§³Ö432̨AIÍÆËã·þÎñÆ÷¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

µ¥POD×éÍø¼Ü¹¹Í¼

¶à¼¶POD×éÍøÖÐ £¬Äܹ»ÊµÏÖ»ùÓÚPODµÄ°´Ð轨Éè¡£ÓÉÓڸó¡¾°PODÖÐNCFÉ豸Ҫ¾ÍÒåÒ»°ëµÄSerDesÓÃÓÚÏνӵڶþ¼¶µÄNCF £¬ËùÒÔ´Ëʱµ¥PODѡȡ48̨NCP×÷Ϊ½ÓÈë £¬ÏÂÐй²36¸ö200G½Ó¿Ú £¬µ¥PODÄÚÄܹ»Ö§³Ö1728¸ö200G½Ó¿Ú¡£Í¨¹ýºáÏòÔö³¤PODʵÏÖ¹æÄ£µÄÀ©ÈÝ £¬ÕûÌå×î´ó¿ÉÖ§³Ö10368¶à¸ö200GÍøÂç¶Ë¿Ú¡£

NCPÉÏÐÐ40¸ö200G½ÓPODÄÚ40̨NCF £¬PODÄÚNCFѡȡ48¸ö200G½Ó¿ÚÏÂÐÐ £¬48¸ö200G½Ó¿Ú·ÖΪ16¸öÒ»×éÉÏÐе½µÚ¶þ¼¶µÄNCF¡£µÚ¶þ¼¶NCFѡȡ40¸öÆ½Ãæ £¬Ã¿¸öÆ½Ãæ3̨µÄÉè¼Æ £¬±ðÀë¶ÔÓ¦ÔÚPODÄÚµÄ40̨NCF¡£

Õû¸öÍøÂçµÄPODÄÚʵÏÖÁ˳¬¿ì±È1.1:1 £¬¶øÔÚPODºÍ¶þ¼¶NCFÖ®¼äʵÏÖÁË1:1µÄÊÕÁ²±È¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

200GµÄÍøÂç¶Ë¿Ú¼æÈÝ100GÍø¿¨½ÓÈë £¬ÌØÊâÇé¿öÏ¿ÉÀûÓÃ1·Ö2»ò1·Ö4ÏßÀ¼æÈÝ25/50GÍø¿¨¡£

»ùÓÚVOQ+Cell»úÔì¸ºÔØ¸üƽºâ £¬¶ª°üÂʸüµÍ

ÒÀ¸½·Ô쬺óµÄCellsת·¢»úÔì½øÐж¯Ì¬¸ºÔØÆ½ºâ £¬ÊµÏÖÑÓ³¤µÄ²»±äÐÔ £¬½µµÍÁË·ÖÆçÁ´Â·µÄ´ø¿í·åÖµ²î¡£

ת·¢Á÷³ÌÈçͼËùʾ£º

Ê×ÏÈ·¢ËͶ˴ÓÍøÂçÖнӹÜÊý¾Ý°ü²¢·ÖÀൽVOQsÖд洢 £¬ÔÚ·¢ËÍÊý¾Ý°ü֮ǰ»áÏÈ·¢ËÍCredit±¨ÎÄÈ·¶¨½Ó¹Ü¶ËÊÇ·ñÓÐ×ã¹»µÄ»º´æ¿Õ¼ä´¦ÖÃÕâЩ±¨ÎÄ£»

ÈôÊÇÄܹ»Ôò½«Êý¾Ý°ü·Ô쬳ÉCells²¢ÇÒ¶¯Ì¬¸ºÔØÆ½ºâµ½ÖÐÑëµÄFabric½Úµã¡£ÕâЩCellsÔÚ½Ó¹Ü¶Ë»á½øÐгÁ×éºÍ´æ´¢ £¬½ø¶ø×ª·¢µ½ÍøÂçÖС£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

CellsÊÇ»ùÓÚÊý¾Ý°üµÄÇÐÆ¬¼¼Êõ £¬Í¨³£´óÓ×Ϊ 64-256Byte¡£

ÇÐÆ¬ºóµÄCellsƾ¾Ýreachability table ÖÐ cell  destination µÄ²éÎÊÀ´¾ö¶¨ÈôºÎת·¢ £¬²¢Ñ¡È¡ÂÖѯµÄ»úÔì·¢ËÍ¡£ÕâÑù×öµÄÒæ´¦Ïà±ÈECMP°´Á÷½øÐйþÏ£ÍÆËãºóÑ¡ÔñijһÌõ·µÄģʽ £¬ÇÐÆ¬ºóµÄCells¸ºÔØ»á³ä·ÖÀûÓõ½Ã¿Ò»ÌõÉÏÐÐÁ´Â· £¬ËùÓÐÉÏÐÐÁ´Â·µÄ´«ÊäÊý¾ÝÁ¿»á½üËÆÏà³Æ¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ÈôÊǽӹܶËÁÙʱûÄÜÁ¦´¦Öñ¨ÎÄ £¬±¨ÎÄ»áÔÚ·¢ËͶ˵ÄVOQÖÐÔÝ´æ £¬²¢²»»áÖ±½Óת·¢µ½½Ó¹Ü¶Ëµ¼Ö¶ª°üÎÊÌâµÄ²úÉú £¬Ã¿Æ¬DNXоƬÄܹ»ÌṩоƬÄÚOCB»º´æÒÔ¼°Æ¬±í8GBµÄHBM¸ß¿ì»º´æ £¬¶Ô200G¶Ë¿ÚÏ൱ÓÚÄܹ»»º´æ150ms×óÓÒµÄÊý¾Ý¡£Ö»Óе±¶Ô¶ËCredit±¨ÎÄ»¯È·Äܹ»½ÓÊÜʱ²Å»á·¢ËÍ¡£ÕâÑùµÄ»úÔìÏ £¬³ä·ÖÀûÓûº´æÄܹ»´ó·ù¶ÈÏ÷¼õ¶ª°ü £¬ÉõÖÁ²»»á²úÉú¶ª°üÇé¿ö¡£Ï÷¼õÊý¾Ý³Á´« £¬ÕûÌåͨѶʱÑÓ¸ü²»µ÷»»µÍ £¬´Ó¶øÄܹ»Ìá¸ß´ø¿íÀûÓÃÂÊ £¬½ø¶øÌáÉýÒµÎñÍÌÍÂЧÄÜ¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

PFCµ¥Ìø²¿Êðϲ»»á²úÉúËÀËø

ÒÀÕÕDDCµÄÂß¼­À´¿´ £¬ËùÓÐNCPºÍNCFÄܹ»µ±×÷һ̨É豸 £¬ËùÒÔÔÚ´ËÍøÂçÖв¿ÊðRDMAÓòºó £¬Ö»ÔÚÕë¶Ô·þÎñÆ÷µÄ½Ó¿Ú´¦´æÔÚ1¼¶µÄPFC £¬²»»áÏñ´«Í³ÍøÂçÒ»Ñù²úÉú¶à¼¶PFCµÄѹÔìÓëËÀËø¡£Áí±íƾ¾ÝDDCµÄÊý¾Ýת·¢»úÔì £¬¿ÉÔÚ½Ó¿Ú´¦²¿ÊðECN £¬Ò»µ©ÔÚÄÚ²¿µÄCredit»ººÍ´æ»úÔìÎÞ·¨Ö§³ÖÍ»·¢Á÷Á¿ £¬Äܹ»Ïò·þÎñÆ÷¶Ë·¢ËÍCNP±¨ÎÄÒªÇ󽵿죨ͨ³£Çé¿öÏÂÔÚAIµÄͨѶģÐÍÏ £¬All-to-AllºÍAll-Reduce+CellÇÐÆ¬Äܹ»½«Á÷Á¿¾¡¿ÉÄܵį½ºâ £¬ºÜÄѳöÏÖ1¸ö¶Ë¿Ú±»´òÂúµÄÇé¿ö £¬ËùÒÔECNÔÚÎÞÊýÇé¿öÄܹ»²»ÅäÖã©¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

ÎÞNCCÉè¼Æ £¬Ñ¡È¡É¢²¼Ê½OSÌáÉý¿¿µÃסÐÔ

ÔÚÖÎÀí½ÚÔìÆ½ÃæÉÏ £¬ÎªÏàʶ¾öÖÎÀíÍø¹ÊÕÏÒÔ¼°NCCµ¥µã¹ÊÕϵÄÓ°Ïì £¬ÎÒÃÇÈ¡µÞÁËNCCµÄ¼¯ÖнÚÔìÃæ £¬¹¹½¨ÁËÉ¢²¼Ê½OS £¬Í¨¹ýSDNÔËά½ÚÔìÆ÷ͨ¹ý³ß¶È½Ó¿Ú£¨Netconf¡¢GRPCµÈ£©ÅäÖÃÖÎÀíÉ豸 £¬Ã¿Ì¨NCPºÍNCF¶ÀÁ¢ÖÎÀí £¬ÓжÀÁ¢µÄ½ÚÔìÃæºÍÖÎÀíÃæ¡£

²âÊÔ¶Ô±ÈÁ˾Ö

´Ó¹æ»®ÀíÂÛÉÏ˵ £¬DDCÕ¼ÓÐÖ§³Öµ¯ÐÔÀ©´óºÍÖ°Äܼ±¾çµü´ú¡¢¸üÒײ¿Êð¡¢µ¥»ú¹¦ºÄµÍµÈ¶à¶àÓÅÊÆ£»µ«´ÓÏÖʵ½Ç¶ÈÆô³Ì £¬´«Í³×éÍøÒ²Õ¼ÓÐÖîÈçÊÐÃæ¿ÉÑ¡Æ·ÅÆºÍ²úƷ·Ï߽϶ࡢ¿ÉÖ§³Ö¸ü´ó¹æÄ£µÄ¼¯ÈºµÈ¼¼Êõ³ÉÊì´øÀ´µÄÓÅÊÆ¡£Òò¶øÔÚ¿Í»§Ãæ¶ÔÏîÄ¿ÐèҪʱµ½µ×ÊÇÑ¡Ôñ¸ü¸ß»úÄܵÄDDC £¬»¹ÊǸü´ó¹æÄ£²¿ÊðµÄ´«Í³×éÍø £¬Äܹ»²Î¿¼ÏÂÃæµÄ¶Ô±È¼°²âÊÔÁ˾֣º

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

´«Í³×éÍøÓëDDC²âÊÔ¶Ô±ÈÁ˾Öͼ

ͬʱÎÒÃÇʹÓÃOpenMPI²âÊÔÌ×¼þ½øÐÐÁË¿òʽÉ豸£¨¿òʽÉ豸ºÍDDCµÀÀíÒ»Ñù £¬±¾´Îѡȡ¿òʽ²âÊÔ£©ºÍ´«Í³×éÍøÉ豸µÄ¶Ô±È·ÂÕÕ²âÊÔ £¬½áÂÛÊÇÔÚAll-to-All³¡¾°Ï £¬Ïà½ÏÓÚ´«Í³µÄ×éÍø £¬¿òʽÉ豸´ø¿íÀûÓÃÂÊÌáÉýÔ¼20%£¨¶ÔÓ¦GPUÀûÓÃÂÊÌáÉý8%×óÓÒ£©¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

¿òʽÉ豸ºÍ´«Í³×éÍøÉ豸µÄ¶Ô±È·ÂÕÕ²âÊÔ

GA»Æ½ð¼×É豸½éÉÜ

»ùÓÚ¶Ô¿Í»§ÐèÒªµÄÉî¿ÌÀí½â £¬GA»Æ½ð¼×ÍøÂçÒѾ­ÂÊÏÈÍÆ³öÁËÁ½¿î¿É½»¸¶²úÆ· £¬±ðÀëÊÇ200G NCP»¥»»»úºÍ200G NCF»¥»»»ú¡£

NCP£ºRG-S6930-36DC40F1»¥»»»ú

¸Ã»¥»»»ú2U¸ß¶È £¬Ìṩ36¸ö200GµÄÃæ°å¿Ú £¬40¸ö200GµÄFabricÄÚÁª¿Ú £¬4¸öµçÉȺÍ2¸öµçÔ´¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

NCF£ºRG-X56-96F1»¥»»»ú

¸Ã»¥»»»ú4U¸ß¶È £¬Ìṩ96¸ö200GµÄFabricÄÚÁª¿Ú £¬8¸öµçÉȺÍ4¸öµçÔ´¡£

GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

½«À´GA»Æ½ð¼×ÍøÂ绹»á³ÖÐøÑз¢¡¢ÍƳö400G¶Ë¿Ú״̬²úÆ· £¬¾´ÇëµÈ´ý¡£

½áÓï

GA»Æ½ð¼×ÍøÂ磨֤ȯ´úÂ룺301165£©×÷ΪÐÐÒµ¸¨µ¼Õß £¬Ò»ÏòÖÂÁ¦ÓÚÌṩ¸ßÆ·ÖÊ¡¢¸ß¿¿µÃסÐÔµÄÍøÂçÉ豸ºÍ½â¾ö¹æ»® £¬ÒÔÂú×ã¿Í»§¶ÔÓÚÖÇËãÖÐÐIJ»ÐÝÌá¸ßµÄÐèÒª¡£ÔÚÍÆ³ö“ÖÇ¿ì“DDC½â¾ö¹æ»®µÄͬʱ £¬GA»Æ½ð¼×ÍøÂçÒ²ÔÚ»ý¼«Ë÷ÇóºÍ¿ª·¢´«Í³×éÍøÖеĶËÍøÓÅ»¯¹æ»® £¬Í¨¹ý³ä·ÖÀûÓ÷þÎñÆ÷ÖÇÄÜÍø¿¨´îÅäÍøÂçÉ豸ºÍ̸µÄÓÅ»¯ £¬ÊµÏÖÕûÍø´ø¿íÀûÓÃÂÊÌáÉý £¬Ô®ÊÖ¿Í»§¸ü¿ìÓ­À´AIGCÖÇËãʱÆÚ¡£

²Î¿¼Îļþ£º

[1]Deepak Narayanan, Mohammad Shoeybi, Jared Casper £¬Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM £¬arXiv:2104.04473v5 [cs.CL] 23 Aug 2021

¹Ø×¢GA»Æ½ð¼×
gfwx_logo
¹Ø×¢GA»Æ½ð¼×¹ÙÍøÎ¢ÐÅ
ËæÊ±Ïàʶ¹«Ë¾×îж¯Ì¬
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾

·µ»Ø¶¥²¿

ÊÕÆð
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾ ÎĵµAI¸±ÊÖ
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾ ÎĵµÆÀ¼Û
ev-close
¸Ã×ÊÁÏÊÇ·ñ½â¾öÁËÄúµÄÎÊÌâ £¿
ev-close
Äú¶Ôµ±Ç°Ò³ÃæµÄÖÐÒâ¶ÈÈôºÎ £¿
²»Õ¦µÎ
¼«¶ÈºÃ
dark-star dark-star dark-star dark-star dark-star
ev-close
ÄúÖÐÒâµÄÔ­ÒòÊÇ£¨¶àÑ¡£© £¿
ev-close
Äú²»ÖÐÒâµÄÔ­ÒòÊÇ£¨¶àÑ¡£© £¿
ev-close
ÄúÊÇ·ñ»¹ÓÐÆäËûÎÊÌâ»ò½¨Òé £¿
ΪÁ˼±¾ç½â¾ö²¢»Ø¸´ÄúµÄÎÊÌâ £¬ÄúÄܹ»ÁôÏÂÁªÏµ·½Ê½
ÓÊÏä
ÊÖ»úºÅ
ev-bg
¸Ð¼¤ÄúµÄ·´À¡£¡
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾
GA»Æ½ð¼×¡¤(Öйú¼¯ÍÅ)¹Ù·½ÍøÕ¾
ÇëÑ¡Ôñ·þÎñÏîÄ¿
¹Ø¹ØÕ÷ѯҳ
ÊÛǰÕ÷ѯ ÊÛǰÕ÷ѯ
ÊÛǰÕ÷ѯ
ÊÛºó·þÎñ ÊÛºó·þÎñ
ÊÛºó·þÎñ
¶¨¼û·´À¡ ¶¨¼û·´À¡
¶¨¼û·´À¡
¸ü¶àÁªÏµ·½Ê½
¡¾ÍøÕ¾µØÍ¼¡¿