image source head

AI 2025 Silicon Valley Answer: 60 key insights

trendx logo

Reprinted from panewslab

01/26/2025·3M

Silicon Valley Answers to AI 2025: 60 Key Insights

Source: Stone Study Notes

Editor 's note:

At the end of 2024, a group of domestic large model companies will launch new products, letting people see that AI is still hot. In Silicon Valley, after heated discussions, AI practitioners have summarized some consensus on the AI ​​industry in 2025, as well as many "non-consensus". For example, investors in Silicon Valley believe that AI companies are a "new species" and AI applications will be an investment hotspot in 2025.

From January 11th to 15th, Jinqiu Fund held a "Scale with AI" event in Silicon Valley, inviting A16Z, Pear VC, Soma Capital, Leonis Capital, Old Friendship Capital, OpenAI, xAI, Anthropic, Google, Meta, Microsoft, Apple , Tesla, Nvidia, ScaleAI, Perplexity, Character.ai , Midjourney, Augment, Replit, Codiuem, Limitless, Luma, Runway and other companies.

After the exchange, we also summarized the opinions of these experts to form these 60 insights.

01 Model

1. The pre-training stage of LLM is close to the bottleneck

But there are still many opportunities for post-training

In the Pre-training stage, Scaling slows down and there is still some time before saturation.

Reasons for the slowdown: Structure > Computing Power > Data (Single-Model).

But on Multi-model: data = computing power > structure.

For MultiModel, it is necessary to select a combination of multiple modalities. Pre-training can be considered to be over under the existing architecture, but it can be modified to a new architecture.

The reason why there is less investment in pre-training now is mainly due to limited resources, and the marginal benefit of post-training will be higher.

2. The relationship between Pre-training and RL

Pre-training doesn’t care much about data quality.

Post-training has high requirements on data quality, but due to computing power limitations, high-quality data is provided in the last few parts.

Pre-training is imitation and can only imitate.

RL is creation and can do different things

There is Pre-training first, and then there is RL in Post-training. The model must have basic capabilities so that RL can be targeted.

RL does not change the intelligence of the model, but more of a thinking mode. For example, using RL to optimize Engagement in C.AI works very well.

3. Large model optimization will affect product capabilities

Generally, it is mainly in the post training part, helping to do a lot of safety, such as solving the problem of child suicide. C.AI will use different models to serve different groups of people and different ages.

Next is the Multiagent framework. The model will think about what to do in order to solve this problem, and then assign it to different agents. After each agent completes the task, it will serve the task and the final result will be optimized.

4. Some non-consensus may achieve consensus next year

Is it necessary to make all of them large models? There have been many good small models before, so there may be no need to make another model.

What is a large model now will become a small model in one year.

Model architecture may change. Sacling law has arrived, and the issues to be discussed in the future, knowledge model decoupling, may be faster.

5. As the Scaling law comes to an end in the LLM field, the gap between closed source and open source is narrowing.

6. Video generation is still at the time point of GPT1 and 2.

The current video level is close to the SD1.4 version. In the future, there will be an open source version of video with similar performance to commercial products.

The current difficulty is the data set. The images rely on the LIAON data set, which can be cleaned by everyone. Due to copyright issues and other issues, there are not such large public data sets for videos. How each company obtains, processes, and cleans the data will have many differences, which will lead to model capabilities. Different, the difficulty of the open source version is also different.

The next more difficult point of the DiT plan is how to improve the compliance with physical laws, not just statistical probabilities.

The efficiency of video generation is stuck. At present, it takes a long time to run on high-end graphics cards, which is an obstacle to commercialization and is also the direction of discussion in academia.

Similar to LLM, although the model iteration speed is slowing down, the application has not slowed down. From a product perspective, just making Wensheng videos is not a good direction. Related editing and creative products will emerge in endlessly, and there will be no bottleneck in the short term.

7. It will be a trend to choose different technology stacks for different scenarios.

When Sora came out, everyone thought it would converge to DiT, but in fact there are still many technical paths being worked on, such as paths based on GAN, and real-time generation of AutoRegressive, such as the recently popular project Oasis, and the combination of CG and CV. To achieve better consistency and control, each company has different choices. It will be a trend in the future to choose different technology stacks for different scenarios.

8. Video Scaling Law is far from LLM level

The scaling law of video exists within a certain range, but it is far from the level of llm. The current maximum level of model parameters is 30b, and it has been proven to be effective within 30b; but at the level of 300b, there are no successful cases.

Now the technical solutions are convergent and the methods are not much different. The main difference is in data, including data ratio.

It will take 1-2 years to reach the saturation of DiT technology route. There is a lot that could be improved on the DiT route. A more efficient model architecture is very important. Take LLM as an example. At the beginning, everyone was working on a larger model. Later, they found that after adding MOE and optimizing the data distribution, they could do it without such a large model.

More research needs to be invested, blindly scaling up DiT is very inefficient. If you include YouTube and TikTok, the amount of video data is very large, and it is impossible to use all of it for model training.

At this stage, there is relatively little open source work, especially in data preparation. The cleaning methods of each company are very different, and the data preparation process has a great impact on the final effect, so some of them can be optimized. Lots of points.

9. Methods to improve the speed of video generation

The simplest is to generate low-resolution, low-frame-rate images. The most commonly used one is step distillation. There are steps in diffusion reasoning. Currently, image generation requires at least 2 steps. If it can be distilled to 1 step inference, it will be much faster. There is also a paper recently that generates videos in one step. Although it is only a POC now, it is worth paying attention to.

10. Priority of video model iteration

In fact, clarity, consistency, controllability, etc. have not reached saturation, and have not yet reached the point of improving one part at the expense of another. It is currently a stage of simultaneous improvement in the Pre-training stage.

11. Technical solutions to speed up long video generation

You can see where the limits of DiT's capabilities are. The larger the model, the better the data, the higher the resolution, the longer the time, and the higher the success rate.

There is currently no answer to how large the DiT model can be scaled. If a bottleneck occurs at a certain size, a new model architecture may emerge. From an algorithmic perspective, DiT has developed a new reasoning algorithm to support fast processing. What is more difficult is how to add these during training.

The current model's understanding of physical laws is in a statistical sense. The phenomena seen in the data set can be simulated to a certain extent, but it does not really understand physics. There are some discussions in the academic world, such as using some physical rules to generate videos.

12. Integration of video models and other modalities

There will be two aspects of unification: one is the unification of multimodality, and the other is the unification of generation and understanding. For the former, representation must first be unified. For the latter, both text and speech can be unified. The effect of the unification of VLM and diffusion is currently considered to be 1+1<2. This job will be more difficult, not necessarily because the model is not smart enough, but because the two tasks themselves are contradictory, and how to achieve a delicate balance is a complex issue.

The simplest idea is to tokenize them all and put them into the transformer model, and finally unify the input and output. But my personal experience is that it is better to do a single specific modality than to fuse them all together.

In industrial practice, everyone will not do it together. MIT's latest paper potentially shows that if multiple modalities are unified, the effect may be better.

13. There is actually a lot of training data for video modality.

There is actually a lot of video data, so how to efficiently select high-quality data is more important.

The amount depends on the understanding of copyright. But computing power is also a bottleneck. Even if there is so much data, there may not be enough computing power to do it, especially high-definition data. Sometimes it is necessary to infer the required high-quality data sets based on the computing power on hand.

High-quality data has always been lacking, but even if there is data, the big problem is that everyone does not know what kind of image description is correct and what keywords should be included in the image description.

14. The future of long video generation lies in storytelling

The current video generation is based on material. The future is about stories, and video generation is with purpose. Long videos are not about how long they are, but about their storytelling. In the form of tasks.

For video editing, the speed will be higher. Because the current stuck point is that the speed is too slow. Now they are all in minutes (generated in seconds). Even if there is a good algorithm, it will not work. (Editing does not refer to editing, but to image editing, such as changing people and actions. Such technology exists, but the problem is that it is slow and unusable.)

15. The aesthetic improvement of video generation mainly relies on post training

It mainly relies on the post training stage, such as Conch, which uses a lot of film and television data. In terms of realism, it is the ability of the prototype

16. Two difficulties in video understanding are Long context and Latency.

17. The visual modality may not be the best modality to lead to AGI.

Text modality - you can also change text into pictures and then into videos

Text is the shortcut to intelligence, and the efficiency gap between video and text is hundreds of times

18. The end-to-end speech model is a great progress.

There is no need to manually label and judge the data, and fine emotional understanding and output can be achieved.

19. Multimodal models are still in a very early stage

The multi-modal model is still in a very early stage. It is already difficult to predict the first 1 second of video and the next 5 seconds. It may be even harder to add text later.

In theory, it is best to train with video and text together, but it is difficult to implement it as a whole.

Multimodality cannot currently improve intelligence, but it may be possible in the future. The compression algorithm can learn the relationship between data sets and only requires pure text and pure picture data. After it is released, the video and text can be understood each other.

20. The multi-modal technology path has not yet fully converged.

The quality of the Diffsion model is good, and the current model structure is still being modified;

Alter agreeable Logic is good.

21. There is no consensus yet on the alignment of different modalities.

It has not been decided whether video is a discrete or continuous token.

There are not many high-quality alignments out there yet.

At present, we don’t know whether it is a scientific issue or an engineering issue.

22. It is feasible for a large model to generate data and then train a small model, but the reverse is more difficult.

The difference between synthetic data and real data is mainly a matter of quality.

You can also use various types of data to piece together for synthesis, and the effect is also very good. The pretraining stage is available because the data quality requirements are not high.

23. For LLM, the era of pre-training is basically over.

Now everyone is talking about Post training, which requires high data quality.

24. Post training team building

Theoretical team size: 5 people are enough (not necessarily full-time).

One person builds the pipeline (infrastructure).

One person manages data (data effects).

One person is responsible for the model itself SFT (scientist/paper reader).

One person is responsible for the product making judgments on model arrangement and collecting user data.

Products and UI in the AI ​​era, post training advantages, AI makes up for product and UI understanding, rich development, and will not be biased by AI.

25. Data pipeline construction

Data circulation: data enters the pipeline and new data is generated and returned.

Efficient iteration: data annotation combined with pipeline and AB testing, structured data warehouse.

Data input: Efficiently annotate and enrich user feedback to build a moat.

Initial stage: SFT (continuously re-Loop to this stage).

Subsequent stages: RL (differentiated into heavier RLFH), scoring guide RL, DPO method is easy to collapse, SFT simplified version of RL.

02 Embodiment

1. Embodied robots have yet to usher in a “critical moment” similar to ChatGPT

A core reason is that robots need to complete tasks in the physical world, not just generate text through virtual language.

Breakthroughs in robot intelligence require solving the core problem of "embodied intelligence", that is, how to complete tasks in a dynamic and complex physical environment.

The "critical moment" of a robot needs to meet the following conditions: Versatility: able to adapt to different tasks and environments. Reliability: High success rate in the real world. Scalability: Able to continuously iterate and optimize through data and tasks.

2. The core problem solved by this generation of machine learning is generalization.

Generalization is the ability of an AI system to learn patterns from training data and apply them to unseen data.

There are two modes of generalization:

  • Interpolation: The test data is within the distribution range of the training data.

  • The difficulty of extrapolation lies in whether the training data can cover the test data well, as well as the distribution range and cost of the test data. "Cover" or "coverage" is the key concept here, which refers to whether the training data can effectively cover the diversity of the test data.

3. Vision tasks (such as face recognition and object detection) are mostly interpolation problems.

The work of machine vision is mainly to imitate the perception ability of living things to understand and perceive the environment.

Machine vision models are already very mature for certain tasks (such as cat and dog recognition) because there is a large amount of relevant data to support them. However, for more complex or dynamic tasks, data diversity and coverage remain bottlenecks.

Vision tasks (such as face recognition, object detection) are mostly interpolation problems, and the model covers most test scenarios through training data.

But the model's capabilities are still limited when it comes to extrapolation problems, such as new angles or lighting conditions.

4. Difficulties in the generalization of this generation of robots: most situations belong to extrapolation situations

Environmental complexity: diversity and dynamics of domestic and industrial environments.

Physical interaction issues: physical properties such as door weight, angle difference, wear, etc.

Uncertainty in human-computer interaction: The unpredictability of human behavior places higher demands on robots.

5. Robots with fully human-like generalization capabilities may not be achievable in the current or future generation.

It is extremely difficult for robots to cope with the complexity and diversity of the real world. Dynamic changes in the real environment (such as pets, children, furniture placement, etc.) in the home make it difficult for robots to fully generalize.

Human beings themselves are not omnipotent individuals, but complete complex tasks in society through division of labor and cooperation. Robots also do not necessarily pursue "human-level" generalization capabilities, but focus more on certain specific tasks and even achieve "superhuman" performance (such as efficiency and precision in industrial production).

Even seemingly simple tasks (such as sweeping the floor or cooking) have very high generalization requirements due to the complexity and dynamics of the environment. For example, sweeping robots need to deal with the different layouts, obstacles, ground materials, etc. of thousands of households, which increase the difficulty of generalization.

So, does the robot need to pick your task? For example, robots need to focus on specific tasks rather than pursuing full human capabilities.

6. Stanford Lab’s Choice: Focus on Family Scenes

Stanford's Robotics Laboratory focuses on tasks in domestic scenarios, especially household robots related to an aging society. For example, robots can help complete daily tasks such as folding quilts, picking up items, and opening bottle caps.

Reasons for concern: Countries such as the United States, Western Europe, and China are all facing serious aging problems. Key challenges associated with aging include: Cognitive deterioration: Alzheimer’s disease is a widespread problem, affecting about half of people over the age of 95. Deterioration of motor function: Diseases such as Parkinson's disease and ALS make it difficult for older adults to perform basic daily tasks.

7. Define generalization conditions based on specific scenarios

Identify the environment and scenario that the robot needs to handle, such as a home, restaurant, or nursing home.

Once the scenarios are clear, you can better define the scope of the task and ensure that possible item state changes and environmental dynamics are covered in these scenarios.

The importance of scenario debugging: Debugging of robot products is not just about solving technical problems, but also covering all possible situations. For example, in nursing homes, robots need to handle a variety of complex situations (such as slow movement of the elderly, unstable placement of items, etc.). By working with domain experts (e.g., nursing home administrators, nursing staff), task requirements can be better defined and relevant data collected.

The environment in the real world is not completely controllable like an industrial assembly line, but it can be made "known" through debugging. For example, define the types, placement, dynamic changes, etc. of common objects in the home environment, covering the key points in simulation and real environments.

8. The contradiction between generalization and specialization

Conflict between general models and task-specific models: Using models requires strong generalization capabilities and the ability to adapt to diverse tasks and environments; but this usually requires a large amount of data and computing resources.

Task-specific models are easier to commercialize, but their capabilities are limited and difficult to expand to other fields.

Future robot intelligence needs to find a balance between generality and specialization. For example, through modular design, a common model becomes the basis, and then rapid adaptation is achieved through fine-tuning for specific tasks.

9. The potential of embodied multimodal models

Integration of multi-modal data: Multi-modal models can process multiple inputs such as vision, touch, and language at the same time, improving the robot's understanding and decision-making capabilities in complex scenes. For example, in a grasping task, visual data can help the robot identify the position and shape of the object, while tactile data can provide additional feedback to ensure the stability of the grasp.

The difficulty lies in how to efficiently integrate multi-modal data in the model. How to improve the adaptability of robots in dynamic environments through multi-modal data.

The importance of tactile data: Tactile data can provide additional information to the robot to help it complete tasks in complex environments. For example, when grasping flexible objects, tactile data can help the robot sense the deformation and force of the object.

10. Robot data closed loop is difficult to achieve

The field of robotics currently lacks iconic data sets like ImageNet, making it difficult for research to form unified evaluation standards.

Data collection is expensive, especially when it comes to real-world interaction data. For example, collecting multi-modal data such as tactile, visual, and dynamic data requires complex hardware and environmental support.

The simulator is considered an important tool to solve data closed-loop problems, but the "Sim-to-Real Gap" between simulation and the real world is still significant.

11. Challenge of Sim-to-Real Gap

There are gaps between the simulator and the real world in aspects such as visual rendering and physical modeling (such as friction, material properties). Robots perform well in simulation environments but may fail in real environments. This gap limits the direct application of simulation data.

12. Advantages and challenges of real data

Real data more accurately reflects the complexity of the physical world, but is expensive to collect. Data annotation is a bottleneck, especially when it comes to multimodal data (e.g., tactile, visual, dynamic).

The industrial environment is more standardized and the mission objectives are clearer, which is suitable for the early deployment of robotic technology. For example, in the construction of solar power plants, robots can complete repetitive tasks such as piling, installing panels, and tightening screws. Industrial robots can gradually improve model capabilities through data collection on specific tasks and form a closed loop of data.

13. In robot operation, tactile and force data can provide key feedback information

In robotic operation, tactile and force data can provide critical feedback information, especially during continuous tasks such as grasping and placing.

Form of tactile data: Tactile data is usually time series data, which can reflect the mechanical changes when the robot comes into contact with the object.

The latest research work is to add touch to large models.

14. Advantages of simulation data

The simulator can quickly generate large-scale data and is suitable for early model training and verification. Simulation data is low-cost to generate and can cover a variety of scenarios and tasks in a short time. In the field of industrial robots, simulators have been widely used to train tasks such as grasping and handling.

Limitations of simulation data: The physical modeling accuracy of the simulator is limited. For example, it cannot accurately simulate the material, friction, flexibility and other characteristics of the object. The visual rendering quality of simulation environments is often insufficient, which can result in models performing poorly in real environments.

15. Data simulation: Stanford launched a behavior simulation platform

Behavior is a simulation platform centered on home scenarios, supporting 1,000 tasks and 50 different scenarios, covering a variety of environments from ordinary apartments to five-star hotels.

The platform contains more than 10,000 objects, and through high-precision 3D models and interactive annotations, the physical and semantic properties of the objects (such as cabinet doors can be opened, clothes can be folded, glasses can be broken, etc.) are reproduced.

In order to ensure the authenticity of the simulation environment, the team invested a lot of manpower (such as data annotation by doctoral students) to carefully analyze the physical properties (mass, friction, texture, etc.) and interactive properties (such as whether it is detachable, whether it will deform). Label. Another example is marking the flexible properties of clothes to support the task of folding clothes, or marking the moistening effect of plants after watering.

The Behavior project not only provides a fixed simulation environment, but also allows users to upload their own scenes and objects, annotate and configure them through the annotation pipeline.

At present, simulation can be 80% pretraining, and the remaining 20% ​​needs to be supplemented by data collection and debugging in the real environment.

16. Application of hybrid model

Preliminary training is carried out through simulation data, and then fine-tuning and optimization are carried out through real data. Attempts have been made to scan real scenes into the simulator, allowing the robot to interact and learn in the simulation environment, thus reducing the Sim-to-Real Gap.

17. Challenges of robot data sharing

Data is a company's core asset, and companies are reluctant to share data easily. There is a lack of unified data sharing mechanism and incentive mechanism.

Possible solutions:

Data exchange: The ability for mission-specific companies to contribute data in exchange for a common model.

Data intermediaries: Establish third-party platforms to collect, integrate and distribute data while protecting privacy.

Model sharing: Reduce dependence on original data through API or model fine-tuning.

There are already some companies trying these three methods

18. Choice of dexterous hands and grippers

Advantages of dexterous hands: high degree of freedom and ability to complete more complex tasks. Dexterous hands can compensate for inaccuracies in model predictions through adjustments with multiple degrees of freedom.

Advantages of grippers: low cost, suitable for specific tasks in industrial scenarios. Performs well on assembly line material handling tasks but lacks generalization capabilities.

19. Co-evolution of embodied robot software and hardware

The hardware platform and software model need to be iterated simultaneously. For example, improved sensor accuracy in hardware can provide higher quality data to the model. Different companies have different strategies for software and hardware collaboration:

03 AI application investment

1. Silicon Valley VCs believe that 2025 will be a big year for investment in AI applications.

VCs in Silicon Valley tend to believe that 2025 will be a big opportunity for application investment. There are basically no killer apps for everyone in the United States. Everyone is used to using apps with different functions in different scenarios. The key is to make the user experience as barrier-free as possible.

Last year, almost no attention was paid to application companies. Everyone was looking at LLM and Foundation models.

Investing in applications, VCs will ask, what's your moat?

One of the criteria for Silicon Valley investors to invest in AI products: It is best to only go in one direction, making it difficult for competing products to copy, and there needs to be some network effects; either insights that are difficult to copy; or a technical edge that is difficult to copy; or others. Unobtainable levels of monopoly capital. Otherwise it’s hard to call it an entrepreneurship, it’s more like a business.

2. Silicon Valley VCs believe that AI product companies are a new species

As a new species, AI companies are very different from previous SaaS. After finding PMF, its revenue booming is very fast. The real value creation before hype is in the seed stage.

3. The niche view among VCs is that they can consider investing in Chinese entrepreneurs if conditions permit.

The reason is: the new generation of Chinese founders are very energetic and capable of developing good business models.

But the premise is that the base is in the United States.

China and Chinese entrepreneurs are making many new attempts, but international investors are afraid and do not understand them. The minority thinks it is a value point.

4. Silicon Valley VCs are trying to figure out how to establish their own investment strategies

Soma Capital: Build connections with the best people, let the best people introduce their friends, and create a Life Long Friendship. In the process, inspire, support, and connect these people; build a panoramic map, including market segmentation and project mapping, and want to make data-driven investments. Will invest from Seed to Series C and observe success/failure samples.

Leonis Capital: Research-driven venture capital fund, primarily First Check.

OldFriendship Capital: Work first, invest later. We will work with the founder first, conduct customer interviews, determine some interview guidelines, and figure out product issues together, similar to consulting work. Invest in Chinese projects, and you can judge whether the Chinese founder has the opportunity to work with US customers at work.

Storm Venture: I like Unlocking Growth, and I prefer companies with PMF in Series A. They usually get 1-2M in revenue, and then judge whether there is Unlocking Growth to support their rise to 20M. The core of B2B SaaS is Wage, which is only applicable in scenarios where labor cost is very high. I think the biggest opportunity at the enterprise level is automation work.

Inference Venture: A $50 million fund that believes barriers are built on interpersonal relationships and domain knowledge.

5. Silicon Valley VCs believe that the requirements for MVPs in the AI ​​era have increased.

Engineer, fintech, HR, etc. are the AI ​​product directions that cost more money.

White-collar jobs are expensive, costing 40 US dollars an hour, and labor costs are high. Only 25% of the time is spent working; there may be no middle-level managers in the future and they will be eliminated.

Companies with the most expensive labor costs are generally in fields that are easily penetrated by AI. Hospital operators are basically not Americans, and their hourly wages may be lower than 2 US dollars. It is difficult to be competitive with AI.

There will be a change from Service as a software to AI Agent.

6. 5 AI predictions for 2025 from Leonis Capital, founded by OpenAI researchers

There will be an AI programming application that becomes popular.

Model providers begin to control costs: entrepreneurs need to choose models/agents to create a unique offering.

Cost per action pricing method appears.

Data centers will cause power shocks and may require new architectures. In the new framework, the model becomes smaller. Multi agents will become more mainstream.

7. AI native startup company standards

Compared with competition from big companies: no one has money, and the organizational structure is different from traditional SaaS companies. Notion and Canva are more Suffering when using AI, and Notion does not want to suffer damage to the core function.

The customer acquisition cost of AI Native Data is relatively low, and the ROI provided by AI products is relatively clear. There is no need to recruit many people in the AI ​​Scaling process. For 50 million, there may only be 20 people.

In terms of Moat, it lies in model architecture and customization,

8. Large models attach great importance to pre-training, and application companies pay more attention to reasoning.

Each industry has fixed ways and methods of looking at problems, and each industry has its own unique Cognitive Architecture. The newly emerged AI Agent adds Cognitive Architecture on the basis of LLM.

9. Reasoning for AI applications in daily life and how to reward

Reasoning for AI applications in the field of life can be done as intention.

Rewarding is very difficult to read, but math and coding are easy to do.

Consider topic effectiveness and geographical location.

You can only do dynamic reward and do it with similar groups.

10. The content generated by AI is not very real and may be a new form of content.

For example Cat walking and cooking

04 AI Coding Chapter

1. Possible ideas for AI Coding company model training

One possible idea: Initially, you will use the better API of the model company to achieve better results. Even if the cost is higher, after accumulating customer usage data, you will continue to train your own small models in small scenes, thereby constantly replacing some parts. api scenarios to achieve better results at lower cost.

2. The difference between Copilot and Agent modes

The main difference between is asynchrony: The main difference is how asynchronous the AI ​​assistant is in performing its tasks. Co-pilots often require immediate user interaction and feedback, whereas agents can work more independently for longer periods of time before seeking user input. For example, code completion and code chat tools require users to watch and respond in real time. Agents, on the other hand, can perform tasks asynchronously and require less feedback, allowing them to accomplish more

Initially the agent was designed to work independently for a long time (10-20 minutes) before providing results. However, user feedback shows that they prefer more control and frequent interactions. The agent is therefore tuned to work for a short period of time (a few minutes) before asking for feedback, striking a balance between autonomy and user engagement.

Challenges in developing fully autonomous agents: Two major obstacles hinder the development of fully autonomous coding agents. The technology is not advanced enough to handle complex, long-term tasks without failing, leading to user dissatisfaction. Users are still getting used to the concept of AI assistants making breaking changes across multiple files or repositories

3. Core challenges and improvements of Coding Agent

Key areas requiring further development include: 1. Event modeling 2. Memory and world modeling 3. Accurate planning for the future 4. Improving context utilization, especially for long contexts (context utilization will drop significantly beyond 10,000 tokens), To enhance reasoning for extended memory lengths (e.g., 100,000 tokens or more), ongoing research aims to improve memory and reasoning for longer contexts.

Although world modeling may seem unrelated to coding agents, it plays an important role in solving common problems such as inaccurate planning. Solving world modeling challenges improves the coding agent's ability to make more efficient and accurate plans.

4. An important trend in AI Coding is the use of inference enhancement technology, similar to the O3 or O1 method

Methods can significantly improve the overall efficiency of code agents. Although it currently involves a high cost (10-100 times more), it could reduce the error rate by half or even a quarter. As language models advance, these costs are expected to fall rapidly, which may make this approach a common technology route.

The O3 performed significantly better than other models in benchmark tests, including the Total Forces test. The current industry score is generally around 50 points, but O3's score is 70-75 points.

SMV scores have improved rapidly over the past few months. A few months ago, the score was in the 30s, but now it's in the 50s

Model performance enhancement technology: Applying advanced technology can further improve the score to approximately 62 points, according to internal testing. Utilizing O3 can push the score up to 74-75 points. While these enhancements may significantly increase cost, the overall performance improvements are significant.

User experience and latency thresholds: Determining the best balance between performance and user experience can be challenging. For the autocomplete feature, response times exceeding 215-500 milliseconds may cause users to disable the feature. In chat applications, a response time of a few seconds is usually acceptable, but waiting 50-75 minutes is not practical. The threshold for acceptable latency varies by application and user expectations.

Two major barriers to maximizing model quality are computational power requirements and associated costs

5. GitHub Copilot is considered a major competitor.

6. Customer success is crucial to adopting AI coding tools.

After-sales support, training, launch and adoption are key differentiators. A startup has 60-70 people dedicated to customer success, which is about half of its total workforce. This is a big investment but helps ensure customer satisfaction.

more