°ä²¼¹¦·ò£º2023-03-20
AIGC£¨AI-Generated Content£¬ÈËΪÖÇÄܳö²úÄÚÈÝ£©½üÆÚ·¢Õ¹Ñ¸ÃÍ£¬µü´ú¿ìÂʸüÊdzöÏÖÖ¸Êý¼¶µÄ·¢×÷ʽÔö³¤¡£ÆäÖУ¬GPT-4ºÍÎÄÐÄÒ»ÑÔµÄÍÆ³öÒýÆðÁËÈËÃÇ¶ÔÆäóÒ×¼ÛÖµºÍÀûÓó¡¾°µÄ¸ß¶È¹Ø×¢¡£Ëæ×ÅAIGCµÄ·¢Õ¹£¬ÑµÁ·Ä£ÐͲÎÊý¹æÄ£´ÓǧÒÚµ½ÍòÒÚ¼¶±ð£¬µ×²ãGPUÖ§³Ö¹æÄ£Ò²´ïµ½ÁËÍò¿¨¼¶±ð¡£Óɴ˵¼ÖµÄÍøÂç¹æÄ£²»ÐÝÔö´ó£¬ÍøÂç½Úµã¼äͨѼûæ¶Ô×ÅÔ½À´Ô½´óµÄÌôÕ½¡£Ôڴ˲¼¾°Ï£¬ÈôºÎÌáÉýAI·þÎñÆ÷ÍÆËãÄÜÁ¦ºÍ×éÍøÍ¨Ñ¶ÄÜÁ¦²¢Á½È«³É±¾£¬ÒѳÉΪµ±Ç°ÈËΪÖÇÄÜÁìÓòµÄ³ÁÒª×êÑз½ÏòÖ®Ò»¡£
GA»Æ½ð¼×ÍøÂçÕë¶ÔAIGCËãÁ¦¡¢GPUÀûÓÃÂÊÓëÍøÂçµÄ¹ØÏµ£¬ÒÔ¼°Ö÷Á÷HPC×éÍøÃæ¶ÔµÄÌôÕ½£¬ÍƳöÁËÒµ½çÏȽøµÄ“ÖÇ¿ì”DDC£¨Distributed Disaggregated Chassis£¬É¢²¼Ê½·Öɢʽ»úÏ䣩¸ß»úÄÜÍøÂç¹æ»®£¬ÎªAIGCÒµÎñÂòͨ“Èζ½¶þÂö”£¬ÖúÁ¦ËãÁ¦Í»·ÉÃͽø¡£
GA»Æ½ð¼×ÍøÂçDDC²úÆ·Ïνӷ½Ê½Ê¾Òâͼ
ÒÔChatGPTΪÀý£¬ÔÚËãÁ¦·½Ã棬ʹÓÃ΢ÈíAzure AI³¬Ëã»ù´¡ÉèÊ©£¨ÓÉ10000¿é V100 GPU×é³ÉµÄ¸ß´ø¿í¼¯Èº£©ÉϽøÐÐѵÁ·£¬×ÜËãÁ¦¿÷ËðÔ¼3640PF-days£¨¼´Ã¿ÃëһǧÍòÒÚ´ÎÍÆË㣬ÔËÐÐ3640Ì죩£¬ÕâÀï×ö¸ö¹«Ê½»»ËãÒ»ÏÂ10000¿éV100±ØÒªÑµÁ·¶à¾Ã£º

ChatGPTËãÁ¦ºÍѵÁ·¹¦·ò±í
×¢£ºChatGPTËãÁ¦ÐèÒªÎªÍøÉÏ»ñÈ¡£¬Ôڴ˽ö¹©²Î¿¼¡£OpenAI ÔÚËûÃǵÄÎÄÕ“AI and Compute”ÖÐÈç¹ûÀûÓÃÂÊΪ 33%¡£NVIDIA¡¢Ë¹Ì¹¸£ºÍ΢ÈíµÄÒ»×é×êÑÐÈËÔ±ÔÚÉ¢²¼Ê½ÏµÍ³ÉÏѵÁ·´óÐÍ˵»°Ä£Ð͵ÄÀûÓÃÂÊ´ïµ½ÁË 44% µ½ 52%¡£

ChatGPT¹ØÓÚѵÁ·¹¦·òµÄ»Ø¸²
ƾ¾ÝChatGPTµÄ»Ø¸´À´¿´£¬±ÈÁ¦ÇкÏÉÏÃæ±í¸ñÍÆËã³öÀ´µÄ¹¦·ò£¬ÀûÓÃÂÊÓ¦¸Ã»áÔÚ50%×óÓÒ¡£
Äܹ»¿´³öÓ°ÏìÒ»¸öÄ£Ð͵ÄѵÁ·Ê±³¤ÖØÒª³É·ÖÔÚÓÚGPUµÄÀûÓÃÂÊ£¬ÒÔ¼°GPU¼¯Èº´¦ÖÃÄÜÁ¦¡£¶øÕâЩ¹Ø¼üÖ¸±êÓÖÓëÍøÂçЧÄÜÇ×êÇÓйء£ÍøÂçЧÄÜÊÇÓ°ÏìAI¼¯ÈºÖÐGPUÀûÓÃÂʵÄÒ»¸ö³ÁÒª³É·Ö¡£ÔÚAI¼¯ÈºÖУ¬GPUͨ³£ÊÇÍÆËã½ÚµãµÄÖ÷Ìâ×ÊÔ´£¬ÓÉÓÚËüÃÇÄܹ»¸ßЧµØ´¦Öôó¹æÄ£µÄÉî¶È½ø½¨¹¤×÷¡£È»¶ø£¬GPUµÄÀûÓÃÂÊÊܵ½¶à¸ö³É·ÖµÄÓ°Ï죬ÆäÖÐÍøÂçЧÄÜÊÇÒ»¸ö¹Ø¼ü³É·Ö¡£
ÍøÂçÔÚAIѵÁ·ÖбíÑÝ×ÅÖÁ¹Ø³ÁÒªµÄ½ÇÉ«¡£AI¼¯ÈºÍ¨³£Óɶà¸öÍÆËã½ÚµãºÍ´æ´¢½Úµã×é³É£¬ÕâЩ½Úµã±ØÒªÆµÈԵؽøÐÐͨѶºÍÊý¾Ý»¥»»¡£ÈôÊÇÍøÂçЧÄܵÍÏ£¬ÕâЩ½ÚµãÖ®¼äµÄͨѶ½«»á±äµÃ»ºÂý£¬Õ⽫ֱ½ÓÓ°Ïìµ½AI¼¯ÈºµÄËãÁ¦¡£
µÍЧµÄÍøÂç¿ÉÄܵ¼ÖÂÒÔÏÂÎÊÌ⣬´Ó¶ø½µµÍGPUÀûÓÃÂÊ£º
Êý¾Ý´«Ê书·òÔö³¤£ºÔÚµÍЧµÄÍøÂçÖУ¬Êý¾Ý´«ÊäµÄ¹¦·ò½«»áÔö³¤¡£µ±GPU±ØÒªÆÚ´ýÊý¾Ý´«ÊäʵÏÖºóÄÜÁ¦½øÐÐÍÆËãʱ£¬GPUÀûÓÃÂʽ«»á½µµÍ£»
ÍøÂç´ø¿íÆ¿¾±£ºÔÚAI¼¯ÈºÖУ¬GPUͨ³£±ØÒªÆµÈÔµØÓëÆäËûÍÆËã½Úµã½øÐÐÊý¾Ý»¥»»¡£ÈôÊÇÍøÂç´ø¿í²»¼°£¬GPU½«ÎÞ·¨»ñµÃ×ã¹»µÄÊý¾Ý½øÐÐÍÆË㣬´Ó¶øµ¼ÖÂGPUÀûÓÃÂʽµµÍ£»
¹¤×÷µ÷¶È²»Æ½ºâ£ºÔÚµÍЧµÄÍøÂçÖУ¬¹¤×÷¿ÉÄܻᱻ·ÖÅäµ½ÓëGPU·ÖÆçµÄÍÆËã½ÚµãÉÏ¡£µ±±ØÒª´óÁ¿µÄÊý¾Ý´«Êäʱ£¬Õâ¿ÉÄܻᵼÖÂGPUÏÐÖÃÆÚ´ý£¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ¡£
ΪÁËÌá¸ßGPUÀûÓÃÂÊ£¬±ØÒªÓÅ»¯ÍøÂçЧÄÜ¡£ÕâÄܹ»Í¨¹ýѡȡ¸ü¿ìµÄÍøÂç¼¼Êõ¡¢ÓÅ»¯ÍøÂçÍØÆË½á¹¹¡¢ºÏÀíÅäÖôø¿íµÈ²½ÖèÀ´ÊµÏÖ¡£ÔÚѵÁ·Ä£ÐÍÖУ¬É¢²¼Ê½ÑµÁ·µÄ²¢ÐжȣºÊý¾Ý²¢ÐÓ×¢ÕÅÁ¿²¢ÐÐÓëÁ÷Ë®²¢Ðоö¶¨ÁËGPU´¦ÖõÄÊý¾ÝÖ®¼äµÄͨѶģÐÍ¡£Ä£ÐÍÖ®¼äµÄͨѶЧÄÜÊܵ½ÒÔϼ¸¸ö³É·ÖµÄÓ°Ï죺

Ó°ÏìͨѶµÄ³É·Ö
ÆäÖУ¬´ø¿íºÍÉ豸ת·¢Ê±ÑÓÊܵ½Ó²¼þÏÞ¶È£¬¶Ë´¦ÖÃʱÑÓÊܼ¼ÊõÑ¡Ôñ£¨TCP or RDMA£©Ó°Ï죬RDMA»á¸üµÍ£¬ÁжӺͳÁ´«ÔòÊܵ½ÍøÂçÓÅ»¯ºÍ¼¼ÊõÑ¡ÔñµÄÓ°Ïì¡£
ƾ¾ÝÁ¿»¯Ä£ÐÍ[1]£ºGPUÀûÓÃÂÊ = GPUÄÚµü´úÍÆË㹦·ò/£¨GPUÄÚµü´úÍÆË㹦·ò+ÍøÂç×ÜÌåͨѶ¹¦·ò£©À´ÍÆËãµÃ³öÒÔϽáÂÛ£º

´ø¿íÍÌÍÂÓëGPUÀûÓÃÂʵÄÇúÏßͼ ¶¯Ì¬Ê±ÑÓºÍGPUÀûÓÃÂʵÄÇúÏßͼ
Äܹ»¿´µ½ÍøÂç´ø¿íÍÌÍ¡¢¶¯Ì¬Ê±ÑÓ£¨ÓµÈû/¶ª°ü£©¶ÔGPUÀûÓÃÂÊÓ°ÏìÏÔÖø¡£
ƾ¾ÝͨѶ×ÜʱÑÓµÄ×é³ÉÀ´¿´£º

ͨѶ×ÜʱÑÓ×é³Éͼ
¾²Ì¬Ê±ÑÓÏà½ÏÖ®ÏÂÓ°Ïì¸üÓ×£¬ËùÒÔ¸üÓ¦¸Ã×ųÁȥ˼¿¼ÈôºÎÏ÷¼õ¶¯Ì¬Ê±ÑÓ£¬ÕâÑùÄܹ»ÓÐЧµÄÌáÉýGPUµÄÀûÓÃÂÊ£¬´Ó¶ø´ïµ½ÌáÉýËãÁ¦µÄÖ¸±ê¡£
Infiniband×éÍøÊǵ±Ç°¸ß»úÄÜÍøÂçµÄ³ÉЧ×îÓŽ⣬ÀûÓó¬¸ß´ø¿íºÍ»ùÓÚCreditµÄ»úÔìÈ·±£ÎÞÓµÈûºÍ³¬µÍʱÑÓ£¬µ«ÊÇÒ²ÊÇ×î°º¹óµÄ½â·¨£¬Ïà±Èͬ´ø¿íÏ´«Í³ÒÔÌ«ÍøµÄ×éÍø»á¹óÊý±¶¡£Í¬Ê±Infiniband¼¼Êõ·â¹Ø£¬ÒµÄÚĿǰ³ÉÊ칩¸øÉ̽ö1¼Ò£¬¶ÔÓÚ×îÖÕÓû§À´Ëµ£¬ÎÞ·¨ÊµÏÖµÚ¶þ»õÔ´¡£
ËùÒÔÒµÄÚ´óÎÞÊýÓû§»áÑ¡Ôñ´«Í³ÒÔÌ«Íø×éÍøµÄ¹æ»®¡£
µ±Ç°¸ß»úÄÜÍøÂçÖ÷Á÷×éÍø¹æ»®ÊÇ»ùÓÚRoCE v2À´×齨֧³ÖRDMAµÄÍøÂç¡£ÆäÖгÁÒªµÄÁ½Ïî´îÅä¼¼ÊõÊÇPFCºÍECN£¬Á½Õß¾ùÊÇΪÁËÔ¤·ÀÁ´Â·ÖеÄÓµÈû¶ø²úÉúµÄ¼¼Êõ¡£
¶à¼¶PFC×éÍøÏ»áÕë¶Ô»¥»»»úÈë¿Ú£¨Ingress£©ÓµÈû£¬Öð¼¶·´Ñ¹µ½Ô´¶Ë·þÎñÆ÷ÔÝÍ£·¢ËÍ£¬»º½âÍøÂçÓµÈû£¬¶ã±Ü¶ª°ü£»µ«¸Ã¹æ»®Ôڶ༶×éÍøÏ¿ÉÄÜ»áÎî¶ÔPFC Deadlockµ¼ÖÂRDMAÁ÷Á¿ÖÕ³¡×ª·¢µÄ·çÏÕ¡£
ͼƬ
PFC¹¤×÷»úÔìʾÒâͼ
¶øECNÔò»á»ùÓÚ¶Ô»¥»»»ú³ö¿Ú£¨Egress£©ÓµÈûµÄÖ÷ÕŶ˸ÐÖª£¬Ö±½ÓÌìÉúÒ»¸öRoCEv2 CNP°ü֪ͨԴ¶Ë½µ¿ì£¬Ô´·þÎñÆ÷ÊÕµ½CNP±¨ÎÄ£¬¾«×¼½µµÍ¶ÔÓ¦QPµÄ·¢ËÍ¿ìÂÊ£¬»º½âÓµÈûµÄͬʱԤ·ÀÎÞ²î¾à½µ¿ì¡£

ECNÏóÕ÷λʾÒâͼ
ÕâÁ½Ïî¼¼Êõ×ÔÉí²¢Ã»ÓÐʲôÎÊÌ⣬¶¼ÊÇΪÏàʶ¾öÓµÈû¶øµ®ÉúµÄ¼¼Êõ£¬µ«ÊÇѡȡÕâÖÖ¼¼Êõºó¿ÉÄÜ»á±»ÍøÂçÖпÉÄܲúÉúµÄÓµÈû¶øÆµÈÔ´¥·¢£¬×îÖջᵼÖÂÔ´¶ËÔÝÍ£»ò½µ¿ì·¢ËÍ£¬Í¨Ñ¶´ø¿í»á½µµÍ£¬»á¶ÔGPUÀûÓÃÂʲúÉú±ÈÁ¦´óµÄÓ°Ï죬´Ó¶øÔì³ÉÕû¸ö¸ß»úÄÜÍøÂçµÄËãÁ¦±»ÀµÍ¡£
ÔÚAIѵÁ·ÍÆËãÖлáÓÐAll-ReduceºÍAll-to-AllÁ½ÖÖÖØÒªµÄÄ£ÐÍ£¬Á½ÖÖÄ£ÐͶ¼±ØÒªÆµÈԵĴÓÒ»¸öGPUµ½Áí±í¶à¸öGPU½øÐÐͨѶ¡£

All-to-AllÄ£ÐÍ All-ReduceÄ£ÐÍ
ÔÚ´«Í³×éÍøÏ£¬ToRºÍLeafÉ豸ѡȡ·ÓÉ+ECMPµÄ×éÍøÄ£Ê½£¬ECMP»á»ùÓÚÁ÷½øÐйþÏ£¸ºÔØÑ¡Â·£¬ÓÐÒ»ÖÖ¼«¶ËÇé¿ö¾ÍÊÇijһÌõECMPÁ´Â·ÓÉÓÚÒ»Ìõ´óÏóÁ÷¶øÅÜÂú£¬ÆäÓà¶àÌõECMPÁ´Â·Ïà¶Ô¿ÕÏУ¬Ôì³É¸ºÔز»¾ùµÄÇé¿ö¡£

´«Í³ECMP²¿Êðͼ
ÔÚÄÚ²¿·ÂÕÕ8ÌõECMPÁ´Â·µÄ²âÊÔ»·¾³Ï£¬²âÊÔÁ˾ÖÈçÏ£º

ECMPÁ÷Á¿²âÊÔÁ˾Ö
Äܹ»¿´³ö£¬»ùÓÚÁ÷µÄECMP»áÔì³É½ÏÏÔÖøµÄij¼¸ÌõÁ´Â·Õ¼Óã¨ECMP1-5ºÍ1-6£©ºÍ¿ÕÏУ¨ECMP1-0ÖÁ1-3½Ï¿ÕÏУ©£¬¶øÔÚAll-ReduceºÍAll-to-AllµÄÁ½ÖÖÄ£ÐÍÏ£¬ ¾ÍºÜÈÝÒ×Ôì³ÉÒ»Ìõ·ÏßÓÉÓÚECMPµÄ¸ºÔز»¾ù¶øÓµÈû£¬Ò»µ©ÓµÈûÔì³É³Á´«£¬¾Í»áÌáÉý×ÜÌåµÄͨѶ×ÜʱÑÓ£¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ¡£
ËùÒÔ£¬ÎªÏàʶ¾ö´ËÀàÎÊÌ⣬×êÑнçÌá³öÁËphost¡¢Homa¡¢NDP¡¢1RMA ºÍ AeolusµÈ·á˶µÄ½â¾ö¹æ»®£¬ËüÃÇÔÚ·ÖÆçˮƽÉϽâ¾öÁË incast£¬ »¹½â¾öÁ˸ºÔØÆ½ºâºÍµÍÑÓ³¤ÒªÇó/ÏìÓ¦Á÷Á¿µÄÎÊÌâ¡£µ«ÊÇÒ²´øÀ´ÁËеÄÌôÕ½£¬ÍùÍùÕâЩ×êÑеĹ滮¶¼ÊDZØÒª¶Ëµ½¶ËÀ´½â¾öÎÊÌ⣬¶ÔÖ÷»ú¡¢Íø¿¨¡¢ÍøÂçµÄŤת½Ï´ó£¬¶ÔÓÚͨ³£Óû§¶øÑÔ£¬³É±¾½Ï¸ß¡£
º£±íÓв¿ÃÅ»¥ÁªÍø¹«Ë¾¼Äµ«Ô¸ÓÚÀûÓÃѡȡDNXоƬ֧³ÖVOQ¼¼ÊõµÄ¿òʽ»¥»»»úÀ´½â¾ö¸ºÔز»Æ½ºâ´øÀ´µÄ´ø¿íÀûÓÃÂʵ͵ÄÎÊÌ⣬µ«Ò²Ãæ¶ÔÒÔϼ¸¸öÌôÕ½¡£
À©´óÄÜÁ¦Í¨³££¬»ú¿ò´óÓ×ÏÞ¶ÈÁË×î´ó¶Ë¿ÚÊý£¬ÈçÏë×ö¸ü´ó¹æÄ£µÄ¼¯Èº£¬±ØÒªºáÏòÀ©´ó¶à¸ö»ú¿ò£¬Ò²»á²úÉú¶à¼¶PFCºÍECMPµÄÁ´Â·£¬ËùÒÔ¿òÖ»ÊʺÏÓÚÓ×¹æÄ£²¿Êð£»
É豸¹¦ºÄ´ó£¬»ú¿òÄÚÏß¿¨Ð¾Æ¬¡¢FabricоƬ¡¢µçÉȵÈÊýÁ¿¶à¶à£¬µ¥É豸µÄ¹¦ºÄ¼«´ó£¬ÇáËɳ¬¹ý2ÍòÍߣ¬ÓеÄÉõÖÁ3Íò¶àÍߣ¬¶Ô»ú¹ñµçÁ¦ÒªÇó¸ß£»
µ¥É豸¶Ë¿ÚÊýÁ¿¶à£¬¹ÊÕÏÓò´ó¡£
ËùÒÔ»ùÓÚÒÔÉÏÔÒò£¬¿òʽÉ豸ֻÊʺÏÓ×¹æÄ£²¿ÊðAIÍÆË㼯Ⱥ¡£
DDCÊÇÒ»ÖÖÉ¢²¼Ê½½âñî»ú¿òÉ豸µÄ½â¾ö¹æ»®£¬Ñ¡È¡µÄоƬºÍ¹Ø¼ü¼¼ÊõÓ봫ͳ¿òʽ»¥»»»úÏÕЩһÑù£¬µ«DDC¼Ü¹¹µ¥Ò»Ö§³Öµ¯ÐÔÀ©´óºÍÖ°Äܼ±¾çµü´ú¡¢¸üÒײ¿Êð¡¢µ¥»ú¹¦ºÄµÍ¡£
ÈçÏÂͼËùʾ£¬ÒµÎñÏß¿¨×÷Ϊǰ¶Ë³ÉΪNCP½ÇÉ«£¬»¥»»Íø°å×÷Ϊºó¶Ë³ÉΪNCF½ÇÉ«£¬ÔÏÈÁ½ÕßÖ®¼äµÄÏÎ½ÓÆ÷×é¼þ´Ë¿Ì±»¹âÏËÏßÀ°ü°ì£¬ÔÓпòʽÉ豸µÄÖÎÀíÒýÇæÔÚDDC¼Ü¹¹ÖÐÒ²³ÉΪÁËNCC¼¯ÖÐ/É¢²¼Ê½µÄÖÎÀí×é¼þ¡£

DDC²úÆ·Ïνӷ½Ê½Ê¾Òâͼ
DDC¼Ü¹¹Ïà½ÏÓÚ¿òʽ¼Ü¹¹µÄÓÅÊÆÔÚÓÚÄܹ»Ìṩµ¯ÐÔ¿ÉÀ©´óÐÔ£¬×éÍø¹æÄ£Äܹ»Æ¾¾ÝAI¼¯Èº´óÓ×À´½Ã½ÝÑ¡Ôñ¡£
µ¥POD×éÍøÖУ¬Ñ¡È¡96̨NCP×÷Ϊ½ÓÈ룬ÆäÖÐNCPÏÂÐй²36¸ö200G½Ó¿Ú£¬ÕƹÜÏνÓAIÍÆË㼯ȺµÄÍø¿¨¡£ÉÏÐй²40¸ö200G½Ó¿Ú×î´óÄܹ»ÏνÓ40̨NCF£¬NCFÌṩ96¸ö200G½Ó¿Ú£¬¸Ã¹æÄ£¸ßµÍÐдø¿íΪ³¬¿ì±È1.1:1¡£Õû¸öPOD¿ÉÖ§³Ö3456¸ö200GÍøÂç½Ó¿Ú£¬ÒÀÕÕһ̨·þÎñÆ÷Åä8¿éGPUÀ´ÍÆË㣬¿ÉÖ§³Ö432̨AIÍÆËã·þÎñÆ÷¡£

µ¥POD×éÍø¼Ü¹¹Í¼
¶à¼¶POD×éÍøÖУ¬Äܹ»ÊµÏÖ»ùÓÚPODµÄ°´Ð轨Éè¡£ÓÉÓڸó¡¾°PODÖÐNCFÉ豸Ҫ¾ÍÒåÒ»°ëµÄSerDesÓÃÓÚÏνӵڶþ¼¶µÄNCF£¬ËùÒÔ´Ëʱµ¥PODѡȡ48̨NCP×÷Ϊ½ÓÈ룬ÏÂÐй²36¸ö200G½Ó¿Ú£¬µ¥PODÄÚÄܹ»Ö§³Ö1728¸ö200G½Ó¿Ú¡£Í¨¹ýºáÏòÔö³¤PODʵÏÖ¹æÄ£µÄÀ©ÈÝ£¬ÕûÌå×î´ó¿ÉÖ§³Ö10368¶à¸ö200GÍøÂç¶Ë¿Ú¡£
NCPÉÏÐÐ40¸ö200G½ÓPODÄÚ40̨NCF£¬PODÄÚNCFѡȡ48¸ö200G½Ó¿ÚÏÂÐУ¬48¸ö200G½Ó¿Ú·ÖΪ16¸öÒ»×éÉÏÐе½µÚ¶þ¼¶µÄNCF¡£µÚ¶þ¼¶NCFѡȡ40¸öÆ½Ãæ£¬Ã¿¸öÆ½Ãæ3̨µÄÉè¼Æ£¬±ðÀë¶ÔÓ¦ÔÚPODÄÚµÄ40̨NCF¡£
Õû¸öÍøÂçµÄPODÄÚʵÏÖÁ˳¬¿ì±È1.1:1£¬¶øÔÚPODºÍ¶þ¼¶NCFÖ®¼äʵÏÖÁË1:1µÄÊÕÁ²±È¡£
200GµÄÍøÂç¶Ë¿Ú¼æÈÝ100GÍø¿¨½ÓÈë£¬ÌØÊâÇé¿öÏ¿ÉÀûÓÃ1·Ö2»ò1·Ö4ÏßÀ¼æÈÝ25/50GÍø¿¨¡£
ÒÀ¸½·Ô쬺óµÄCellsת·¢»úÔì½øÐж¯Ì¬¸ºÔØÆ½ºâ£¬ÊµÏÖÑÓ³¤µÄ²»±äÐÔ£¬½µµÍÁË·ÖÆçÁ´Â·µÄ´ø¿í·åÖµ²î¡£
ת·¢Á÷³ÌÈçͼËùʾ£º
Ê×ÏÈ·¢ËͶ˴ÓÍøÂçÖнӹÜÊý¾Ý°ü²¢·ÖÀൽVOQsÖд洢£¬ÔÚ·¢ËÍÊý¾Ý°ü֮ǰ»áÏÈ·¢ËÍCredit±¨ÎÄÈ·¶¨½Ó¹Ü¶ËÊÇ·ñÓÐ×ã¹»µÄ»º´æ¿Õ¼ä´¦ÖÃÕâЩ±¨ÎÄ£»
ÈôÊÇÄܹ»Ôò½«Êý¾Ý°ü·Ô쬳ÉCells²¢ÇÒ¶¯Ì¬¸ºÔØÆ½ºâµ½ÖÐÑëµÄFabric½Úµã¡£ÕâЩCellsÔÚ½Ó¹Ü¶Ë»á½øÐгÁ×éºÍ´æ´¢£¬½ø¶ø×ª·¢µ½ÍøÂçÖС£
CellsÊÇ»ùÓÚÊý¾Ý°üµÄÇÐÆ¬¼¼Êõ£¬Í¨³£´óÓ×Ϊ 64-256Byte¡£
ÇÐÆ¬ºóµÄCellsƾ¾Ýreachability table ÖÐ cell destination µÄ²éÎÊÀ´¾ö¶¨ÈôºÎת·¢£¬²¢Ñ¡È¡ÂÖѯµÄ»úÔì·¢ËÍ¡£ÕâÑù×öµÄÒæ´¦Ïà±ÈECMP°´Á÷½øÐйþÏ£ÍÆËãºóÑ¡ÔñijһÌõ·µÄģʽ£¬ÇÐÆ¬ºóµÄCells¸ºÔØ»á³ä·ÖÀûÓõ½Ã¿Ò»ÌõÉÏÐÐÁ´Â·£¬ËùÓÐÉÏÐÐÁ´Â·µÄ´«ÊäÊý¾ÝÁ¿»á½üËÆÏà³Æ¡£
ÈôÊǽӹܶËÁÙʱûÄÜÁ¦´¦Öñ¨ÎÄ£¬±¨ÎÄ»áÔÚ·¢ËͶ˵ÄVOQÖÐÔݴ棬²¢²»»áÖ±½Óת·¢µ½½Ó¹Ü¶Ëµ¼Ö¶ª°üÎÊÌâµÄ²úÉú£¬Ã¿Æ¬DNXоƬÄܹ»ÌṩоƬÄÚOCB»º´æÒÔ¼°Æ¬±í8GBµÄHBM¸ß¿ì»º´æ£¬¶Ô200G¶Ë¿ÚÏ൱ÓÚÄܹ»»º´æ150ms×óÓÒµÄÊý¾Ý¡£Ö»Óе±¶Ô¶ËCredit±¨ÎÄ»¯È·Äܹ»½ÓÊÜʱ²Å»á·¢ËÍ¡£ÕâÑùµÄ»úÔìÏ£¬³ä·ÖÀûÓûº´æÄܹ»´ó·ù¶ÈÏ÷¼õ¶ª°ü£¬ÉõÖÁ²»»á²úÉú¶ª°üÇé¿ö¡£Ï÷¼õÊý¾Ý³Á´«£¬ÕûÌåͨѶʱÑÓ¸ü²»µ÷»»µÍ£¬´Ó¶øÄܹ»Ìá¸ß´ø¿íÀûÓÃÂÊ£¬½ø¶øÌáÉýÒµÎñÍÌÍÂЧÄÜ¡£
ÒÀÕÕDDCµÄÂß¼À´¿´£¬ËùÓÐNCPºÍNCFÄܹ»µ±×÷һ̨É豸£¬ËùÒÔÔÚ´ËÍøÂçÖв¿ÊðRDMAÓòºó£¬Ö»ÔÚÕë¶Ô·þÎñÆ÷µÄ½Ó¿Ú´¦´æÔÚ1¼¶µÄPFC£¬²»»áÏñ´«Í³ÍøÂçÒ»Ñù²úÉú¶à¼¶PFCµÄѹÔìÓëËÀËø¡£Áí±íƾ¾ÝDDCµÄÊý¾Ýת·¢»úÔ죬¿ÉÔÚ½Ó¿Ú´¦²¿ÊðECN£¬Ò»µ©ÔÚÄÚ²¿µÄCredit»ººÍ´æ»úÔìÎÞ·¨Ö§³ÖÍ»·¢Á÷Á¿£¬Äܹ»Ïò·þÎñÆ÷¶Ë·¢ËÍCNP±¨ÎÄÒªÇ󽵿죨ͨ³£Çé¿öÏÂÔÚAIµÄͨѶģÐÍÏ£¬All-to-AllºÍAll-Reduce+CellÇÐÆ¬Äܹ»½«Á÷Á¿¾¡¿ÉÄܵį½ºâ£¬ºÜÄѳöÏÖ1¸ö¶Ë¿Ú±»´òÂúµÄÇé¿ö£¬ËùÒÔECNÔÚÎÞÊýÇé¿öÄܹ»²»ÅäÖã©¡£
ÔÚÖÎÀí½ÚÔìÆ½ÃæÉÏ£¬ÎªÏàʶ¾öÖÎÀíÍø¹ÊÕÏÒÔ¼°NCCµ¥µã¹ÊÕϵÄÓ°Ï죬ÎÒÃÇÈ¡µÞÁËNCCµÄ¼¯ÖнÚÔìÃæ£¬¹¹½¨ÁËÉ¢²¼Ê½OS£¬Í¨¹ýSDNÔËά½ÚÔìÆ÷ͨ¹ý³ß¶È½Ó¿Ú£¨Netconf¡¢GRPCµÈ£©ÅäÖÃÖÎÀíÉ豸£¬Ã¿Ì¨NCPºÍNCF¶ÀÁ¢ÖÎÀí£¬ÓжÀÁ¢µÄ½ÚÔìÃæºÍÖÎÀíÃæ¡£
´Ó¹æ»®ÀíÂÛÉÏ˵£¬DDCÕ¼ÓÐÖ§³Öµ¯ÐÔÀ©´óºÍÖ°Äܼ±¾çµü´ú¡¢¸üÒײ¿Êð¡¢µ¥»ú¹¦ºÄµÍµÈ¶à¶àÓÅÊÆ£»µ«´ÓÏÖʵ½Ç¶ÈÆô³Ì£¬´«Í³×éÍøÒ²Õ¼ÓÐÖîÈçÊÐÃæ¿ÉÑ¡Æ·ÅÆºÍ²úƷ·Ï߽϶ࡢ¿ÉÖ§³Ö¸ü´ó¹æÄ£µÄ¼¯ÈºµÈ¼¼Êõ³ÉÊì´øÀ´µÄÓÅÊÆ¡£Òò¶øÔÚ¿Í»§Ãæ¶ÔÏîÄ¿ÐèҪʱµ½µ×ÊÇÑ¡Ôñ¸ü¸ß»úÄܵÄDDC£¬»¹ÊǸü´ó¹æÄ£²¿ÊðµÄ´«Í³×éÍø£¬Äܹ»²Î¿¼ÏÂÃæµÄ¶Ô±È¼°²âÊÔÁ˾֣º

´«Í³×éÍøÓëDDC²âÊÔ¶Ô±ÈÁ˾Öͼ
ͬʱÎÒÃÇʹÓÃOpenMPI²âÊÔÌ×¼þ½øÐÐÁË¿òʽÉ豸£¨¿òʽÉ豸ºÍDDCµÀÀíÒ»Ñù£¬±¾´Îѡȡ¿òʽ²âÊÔ£©ºÍ´«Í³×éÍøÉ豸µÄ¶Ô±È·ÂÕÕ²âÊÔ£¬½áÂÛÊÇÔÚAll-to-All³¡¾°Ï£¬Ïà½ÏÓÚ´«Í³µÄ×éÍø£¬¿òʽÉ豸´ø¿íÀûÓÃÂÊÌáÉýÔ¼20%£¨¶ÔÓ¦GPUÀûÓÃÂÊÌáÉý8%×óÓÒ£©¡£

¿òʽÉ豸ºÍ´«Í³×éÍøÉ豸µÄ¶Ô±È·ÂÕÕ²âÊÔ
»ùÓÚ¶Ô¿Í»§ÐèÒªµÄÉî¿ÌÀí½â£¬GA»Æ½ð¼×ÍøÂçÒѾÂÊÏÈÍÆ³öÁËÁ½¿î¿É½»¸¶²úÆ·£¬±ðÀëÊÇ200G NCP»¥»»»úºÍ200G NCF»¥»»»ú¡£
¸Ã»¥»»»ú2U¸ß¶È£¬Ìṩ36¸ö200GµÄÃæ°å¿Ú£¬40¸ö200GµÄFabricÄÚÁª¿Ú£¬4¸öµçÉȺÍ2¸öµçÔ´¡£
¸Ã»¥»»»ú4U¸ß¶È£¬Ìṩ96¸ö200GµÄFabricÄÚÁª¿Ú£¬8¸öµçÉȺÍ4¸öµçÔ´¡£
½«À´GA»Æ½ð¼×ÍøÂ绹»á³ÖÐøÑз¢¡¢ÍƳö400G¶Ë¿Ú״̬²úÆ·£¬¾´ÇëµÈ´ý¡£
GA»Æ½ð¼×ÍøÂ磨֤ȯ´úÂ룺301165£©×÷ΪÐÐÒµ¸¨µ¼Õߣ¬Ò»ÏòÖÂÁ¦ÓÚÌṩ¸ßÆ·ÖÊ¡¢¸ß¿¿µÃסÐÔµÄÍøÂçÉ豸ºÍ½â¾ö¹æ»®£¬ÒÔÂú×ã¿Í»§¶ÔÓÚÖÇËãÖÐÐIJ»ÐÝÌá¸ßµÄÐèÒª¡£ÔÚÍÆ³ö“ÖÇ¿ì“DDC½â¾ö¹æ»®µÄͬʱ£¬GA»Æ½ð¼×ÍøÂçÒ²ÔÚ»ý¼«Ë÷ÇóºÍ¿ª·¢´«Í³×éÍøÖеĶËÍøÓÅ»¯¹æ»®£¬Í¨¹ý³ä·ÖÀûÓ÷þÎñÆ÷ÖÇÄÜÍø¿¨´îÅäÍøÂçÉ豸ºÍ̸µÄÓÅ»¯£¬ÊµÏÖÕûÍø´ø¿íÀûÓÃÂÊÌáÉý£¬Ô®ÊÖ¿Í»§¸ü¿ìÓÀ´AIGCÖÇËãʱÆÚ¡£
²Î¿¼Îļþ£º
[1]Deepak Narayanan, Mohammad Shoeybi, Jared Casper£¬Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM£¬arXiv:2104.04473v5 [cs.CL] 23 Aug 2021
